i have a program that essentially loops through a textfile file thats
about 800 MB in size containing tab separated data... my program
parses this file and stores its fields in a dictionary of lists.

for line in file:
  split_values = line.strip().split('\t')
  # do stuff with split_values

currently, this is very slow in python, even if all i do is break up
each line using split() and store its values in a dictionary, indexing
by one of the tab separated values in the file.

I'm not sure what the situation is, but I regularly skim through tab-delimited files of similar size and haven't noticed any problems like you describe. You might try tweaking the optional (and infrequently specified) bufsize parameter of the open()/file() call:

  bufsize = 4 * 1024 * 1024 # buffer 4 megs at a time
  f = file('in.txt', 'r', bufsize)
  for line in f:
    split_values = line.strip().split('\t')
    # do stuff with split_values

If not specified, you're at the mercy of the system-default (perhaps OS specific?). You can read more at[1] along with the associated warning about setvbuf()

-tkc


[1]
http://docs.python.org/library/functions.html#open






--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to