On Wed, 30 Jul 2014 17:04:04 -0600, Vincent Davis wrote: > I know this is a general python list and I am asking about pandas but > this question is probably not great for asking on stackoverflow. I have > a list of files (~80 files, ~30,000 rows) I need to process with my > current code it is take minutes for each file.
Hmmm, is that 30,000 rows per file, or 30,000 files in total? Not that it really matters, I shouldn't expect that it makes that much difference either way. > Any suggestions of a fast > way. I am try to stick with pandas for educational purposes. Any > suggestions would be great. If you are curious the can find the data > file I am using below here. > http://www.nber.org/nhamcs/data/nhamcsopd2010.csv For brevity, I've trimmed back the dictionaries to something smaller. That's just for ease of reading. > drugs_current = {'CITALOPRAM': 4332, > 'ESCITALOPRAM': 4812, [...] > 'CLONIDINE': 44, > 'GUANFACINE': 717} > > drugs_98_05 = { 'SERTRALINE': 56635, > 'CITALOPRAM': 59829, [...] > 'CLONIDINE2': 70357, > 'GUANFACINE': 52498 > } > > df = pd.read_csv('nhamcsopd2010.csv' , index_col='PATCODE', > low_memory=False) > col_init = list(df.columns.values) > keep_col = ['PATCODE', 'PATWT', 'VDAY', 'VMONTH', 'VYEAR', 'MED1', > 'MED2', 'MED3', 'MED4', 'MED5'] > for col in col_init: > if col not in keep_col: > del df[col] I expect that this could be your problem here. Deleting from the start or middle of lists is slow, and df may be a list or at least have list-like performance. Suppose you have a list like this: ['a', 'b', 'c', 'd', 'e', ..., 'zzz'] that is, a total of 26 + 26**2 + 26**3 = 18278 items. Now suppose you delete item 0, 'a': => ['b', 'c', 'd', 'e', ..., 'zzz'] Python has to move the remaining 18278 items across one space. Then you delete 'b': => ['c', 'd', 'e', ..., 'zzz'] Python has to move the remaining 18276 items across one space, making a total of 36559 moves. And that's just to delete two items. Roughly speaking, if you end up deleting N items from a list starting from the front, Python may have to move as many as N**2 items into their final positions. If you have 5 or 10 columns, that's not too bad, but if you have (say) 80 columns, and delete 70 of them, that could be *horribly* expensive. If you must *repeatedly* use del on lists, it's best to ensure you're deleting from the end, not the start. But even better, and this applies to anything not just lists, is not to delete at all, but to create a new list, copying the columns you want, rather than deleting the columns you don't want. I'm not familiar with pandas and am not sure about the exact syntax needed, but something like: new_df = [] # Assuming df is a list. for col in df: if col.value in keep_col: new_df.append(col) > if f[-3:] == 'csv' and f[-6:-4] in ('93', '94', '95', '96', '97', '98', > '99', '00', '91', '02', '03', '04', '05'): Where does f come from? You haven't shown the definition of that. -- Steven -- https://mail.python.org/mailman/listinfo/python-list