On 3/18/07, Alex Martelli <[EMAIL PROTECTED]> wrote: > George Sakkis <[EMAIL PROTECTED]> wrote: > > On Mar 18, 12:11 pm, "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote: > > > I need to process a really huge text file (4GB) and this is what i > > > need to do. It takes for ever to complete this. I read some where that > > > "list comprehension" can fast up things. Can you point out how to do > > > f = open('file.txt','r') > > > for line in f: > > > db[line.split(' ')[0]] = line.split(' ')[-1] > > > db.sync() > > You got several good suggestions; one that has not been mentioned but > > makes a big (or even the biggest) difference for large/huge file is > > the buffering parameter of open(). Set it to the largest value you can > > afford to keep the I/O as low as possible. I'm processing 15-25 GB > > files (you see "huge" is really relative ;-)) on 2-4GB RAM boxes and > > setting a big buffer (1GB or more) reduces the wall time by 30 to 50% > > compared to the default value. BerkeleyDB should have a buffering > Out of curiosity, what OS and FS are you using? On a well-tuned FS and
Fedora Core 4 and ext 3. Is there something I should do to the FS? > OS combo that does "read-ahead" properly, I would not expect such > improvements for moving from large to huge buffering (unless some other > pesky process is perking up once in a while and sending the disk heads > on a quest to never-never land). IOW, if I observed this performance > behavior on a server machine I'm responsible for, I'd look for > system-level optimizations (unless I know I'm being forced by myopic > beancounters to run inappropriate OSs/FSs, in which case I'd spend the > time polishing my resume instead) - maybe tuning the OS (or mount?) > parameters, maybe finding a way to satisfy the "other pesky process" > without flapping disk heads all over the prairie, etc, etc. > > The delay of filling a "1 GB or more" buffer before actual processing > can begin _should_ defeat any gains over, say, a 1 MB buffer -- unless, > that is, something bad is seriously interfering with the normal > read-ahead system level optimization... and in that case I'd normally be > more interested in finding and squashing the "something bad", than in > trying to work around it by overprovisioning application bufferspace!-) Which should I do? How much buffer should I allocate? I have a box with 2GB memory. thanks! -- http://mail.python.org/mailman/listinfo/python-list