DJTB wrote: > I'm trying to manually parse a dataset stored in a file. The > data should be converted into Python objects. > > Here is an example of a single line of a (small) dataset: > > 3 13 17 19 -626177023 -1688330994 -834622062 -409108332 > 297174549 955187488 > 589884464 -1547848504 857311165 585616830 -749910209 > 194940864 -1102778558 > -1282985276 -1220931512 792256075 -340699912 1496177106 1760327384 > -1068195107 95705193 1286147818 -416474772 745439854 > 1932457456 -1266423822 > -1150051085 1359928308 129778935 1235905400 532121853 > > The first integer specifies the length of a tuple object. In > this case, the tuple has three element: (13, 17, 19) > The other values (-626177023 to 532121853) are elements of a Set. > > I use the following code to process a file: > > > from time import time > from sets import Set > from string import split > file = 'pathtable_ht.dat' > result = [] > start_time = time () > f=open(file,'r') > for line in f: > splitres = line.split() > tuple_size = int(splitres[0])+1 > path_tuple = tuple(splitres[1:tuple_size]) > conflicts = Set(map(int,splitres[tuple_size:-1])) > # do something with 'path_tuple' and 'conflicts' > # ... do some processing ... > result.append(( path_tuple, conflicts)) > > f.close() > print time() - start_time > > > The elements (integer objects) in these Sets are being shared > between the > sets, in fact, there are as many distinct element as there > are lines in the > file (eg 1000 lines -> 1000 distinct set elements). AFAIK, > the elements are > stored only once and each Set contains a pointer to the actual object > > This works fine with relatively small datasets, but it > doesn't work at all > with large datasets (4500 lines, 45000 chars per line). > > After a few seconds of loading, all main memory is consumed > by the Python > process and the computer starts swapping. After a few more > seconds, CPU > usage drops from 99% to 1% and all swap memory is consumed: > > Mem: 386540k total, 380848k used, 4692k free, > 796k buffers > Swap: 562232k total, 562232k used, 0k free, > 27416k cached > > At this point, my computer becomes unusable. > > I'd like to know if I should buy some more memory (a few GB?) > or if it is > possible to make my code more memory efficient.
The first question I would ask is: what are you doing with "result", and can the consumption of "result" be done iteratively? Robert Brewer System Architect Amor Ministries [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list