alex23於 2012年9月17日星期一UTC+8上午11時25分06秒寫道: > On Sep 17, 12:32 pm, "Jadhav, Alok" <alok.jad...@credit-suisse.com> > > wrote: > > > - As you have seen, the line separator is not '\n' but its '|\n'. > > > Sometimes the data itself has '\n' characters in the middle of the line > > > and only way to find true end of the line is that previous character > > > should be a bar '|'. I was not able specify end of line using > > > readlines() function, but I could do it using split() function. > > > (One hack would be to readlines and combine them until I find '|\n'. is > > > there a cleaner way to do this?) > > > > You can use a generator to take care of your readlines requirements: > > > > def readlines(f): > > lines = [] > > while "f is not empty": > > line = f.readline() > > if not line: break > > if len(line) > 2 and line[-2:] == '|\n': > > lines.append(line) > > yield ''.join(lines) > > lines = [] > > else: > > lines.append(line) > > > > > - Reading whole file at once and processing line by line was must > > > faster. Though speed is not of very important issue here but I think the > > > tie it took to parse complete file was reduced to one third of original > > > time. > > > > With the readlines generator above, it'll read lines from the file > > until it has a complete "line" by your requirement, at which point > > it'll yield it. If you don't need the entire file in memory for the > > end result, you'll be able to process each "line" one at a time and > > perform whatever you need against it before asking for the next. > > > > with open(u'infile.txt','r') as infile: > > for line in readlines(infile): > > ... > > > > Generators are a very efficient way of processing large amounts of > > data. You can chain them together very easily: > > > > real_lines = readlines(infile) > > marker_lines = (l for l in real_lines if l.startswith('#')) > > every_second_marker = (l for i,l in enumerate(marker_lines) if (i > > +1) % 2 == 0) > > map(some_function, every_second_marker) > > > > The real_lines generator returns your definition of a line. The > > marker_lines generator filters out everything that doesn't start with > > #, while every_second_marker returns only half of those. (Yes, these > > could all be written as a single generator, but this is very useful > > for more complex pipelines). > > > > The big advantage of this approach is that nothing is read from the > > file into memory until map is called, and given the way they're > > chained together, only one of your lines should be in memory at any > > given time.
The basic problem is whether the output items really need all lines of the input text file to be buffered to produce the results. -- http://mail.python.org/mailman/listinfo/python-list