In reference to the several comments about "[x for x in read] is basically a copy of the entire list. This isn't necessary." or list(read). I had thought I had a problem with having iterators in the takewhile() statement. I thought I testes and it didn't work. It seems I was wrong. It clearly works. I'll make this change and see if it is any better.
I actually don't plan to read them all in at once, only as needed, but I do need the whole file in an array to perform some mathematics on them and compare different files. So my interest was in making it faster to open them as needed. I guess part of it is that they are about 5mb so I guess it might be disk speed in part. Thanks *Vincent Davis 720-301-3003 * vinc...@vincentdavis.net my blog <http://vincentdavis.net> | LinkedIn<http://www.linkedin.com/in/vincentdavis> On Fri, Feb 19, 2010 at 2:13 PM, Jonathan Gardner < jgard...@jonathangardner.net> wrote: > On Fri, Feb 19, 2010 at 10:22 AM, Vincent Davis > <vinc...@vincentdavis.net>wrote: > >> I have some some (~50) text files that have about 250,000 rows each. I am >> reading them in using the following which gets me what I want. But it is not >> fast. Is there something I am missing that should help. This is mostly an >> question to help me learn more about python. It takes about 4 min right now. >> >> def read_data_file(filename): >> reader = csv.reader(open(filename, "U"),delimiter='\t') >> read = list(reader) >> > > You're slurping the entire file here when it's not necessary. > > >> data_rows = takewhile(lambda trow: '[MASKS]' not in trow, [x for x in >> read]) >> > > [x for x in read] is basically a copy of the entire list. This isn't > necessary. > > >> data = [x for x in data_rows][1:] >> >> > > Again, copying here is unnecessary. > > [x for x in y] isn't a paradigm in Python. If you really need a copy of an > array, x = y[:] is the paradigm. > > >> mask_rows = takewhile(lambda trow: '[OUTLIERS]' not in trow, >> list(dropwhile(lambda drow: '[MASKS]' not in drow, read))) >> > > > > > >> mask = [row for row in mask_rows if row][3:] >> > > Here's another unnecessary array copy. > > >> >> outlier_rows = dropwhile(lambda drows: '[OUTLIERS]' not in drows, >> read) >> outlier = [row for row in outlier_rows if row][3:] >> > > > And another. > > Just because you're using Python doesn't mean you get to be silly in how > you move data around. Avoid copies as much as possible, and try to avoid > slurping in large files all at once. Line-by-line processing is best. > > I think you should invert this operation into a for loop. Most people tend > to think of things better that way than chained iterators. It also helps you > to not duplicate data when it's unnecessary. > > -- > Jonathan Gardner > jgard...@jonathangardner.net >
-- http://mail.python.org/mailman/listinfo/python-list