On Mon, 31 Mar 2008 22:27:39 -0700 (PDT), Paddy <[EMAIL PROTECTED]> wrote: > On Mar 31, 11:47 pm, Jorgen Grahn <[EMAIL PROTECTED]> wrote: >> On 31 Mar 2008 06:54:29 GMT, Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]> >> wrote: >> >> > On Sun, 30 Mar 2008 21:02:44 +0000, Jorgen Grahn wrote: >> >> >> I realize this has to do with the extra read-ahead buffering documented >> >> for >> >> file.next() and that I can work around it by using file.readline() >> >> instead.
>> > You can use ``for line in lines:`` and pass ``iter(sys.stdin.readline,'')`` >> > as iterable for `lines`. >> >> Thanks. I wasn't aware that building an iterator was that easy. The >> tiny example program then becomes >> By the way, I timed the three solutions given so far using 5 million >> lines of standard input. It went like this: >> >> for s in file : 1 >> iter(readline, ''): 1.30 (i.e. 30% worse than for s in file) >> while 1 : 1.45 (i.e. 45% worse than for s in file) >> Perl while(<>) : 0.65 >> >> I suspect most of the slowdown comes from the interpreter having to >> execute more user code, not from lack of extra heavy input buffering. > Hi Juergen, > From the python manpage: > -u Force stdin, stdout and stderr to be totally unbuffered. > On systems where it matters, also put stdin, stdout and > stderr in binary mode. Note that there is internal > buffering in xreadlines(), readlines() and file-object > iterators ("for line in sys.stdin") which is not influenced > by this option. To work around this, you will want to use > "sys.stdin.readline()" inside a "while 1:" loop. > Maybe try adding the python -u option? Doesn't help when the code is in a module, unfortunately. > Buffering is supposed to help when processing large amounts of I/O, > but gives the 'many lines in before any output' that you saw > originally. "Is supposed to help", yes. I suspect (but cannot prove) that the kind of buffering done here doesn't buy more than 10% or so even in artificial tests, if you consider the fact that "for s in f" is in itself a faster construct than my workarounds in user code. Note that even with buffering, there seems to be one system call per line when used interactively, and lines are of course passed to user code one by one. Lastly, there is still the question about having to press Ctrl-D twice to end the loop, which I mentioned my the original posting. That still feels very wrong. > If the program is to be mainly used to handle millions of > lines from a pipe or file, then why not leave the buffering in? > If you need both interactive and batch friendly I/O modes you might > need to add the ability to switch between two modes for your program. That is exactly the tradeoff I am dealing with right now, and I think I have come to the conclusion that I want no buffering. My source data set can be huge (gigabytes of text) but in reality it is boiled down to at most 50000 lines by a Perl script further to the left in my pipeline: zcat foo.gz | perl | python > bar The Perl script takes ~100 times longer time to execute, and both are designed as filters, which means a modest increase in CPU time for the Python script isn't visible to the end user. /Jorgen -- // Jorgen Grahn <grahn@ Ph'nglui mglw'nafh Cthulhu \X/ snipabacken.se> R'lyeh wgah'nagl fhtagn! -- http://mail.python.org/mailman/listinfo/python-list