On Sun, 26 Dec 2010 12:53:29 -0500 Michel Fortin <[email protected]> wrote:
> On 2010-12-26 12:13:41 -0500, Andrei Alexandrescu > <[email protected]> said: > > > On 12/26/10 10:12 AM, bearophile wrote: > >> This is related to this (closed) issue, but this time I prefer to > >> discuss the topic on the newsgroup: > >> http://d.puremagic.com/issues/show_bug.cgi?id=4474 > >> > >> To show why this enhancement request is useful I use a little scripting > >> task. A D2 program has to read a file that contains one word in each > >> line, and print all the words with the highest length. > >> > >> A first naive functional style implementation (if you keep in mind that > >> File opens files in binary mode on default, this default is different > >> from the Python one, I am not sure what's the best default. I think > >> most files I open are text ones, so I think a bit better default is to > >> open in text mode on default), it doesn't work, it produces an access > >> violation: > >> > >> // #1 > >> import std.stdio, std.array, std.algorithm; > >> void main() { > >> auto lazyWords = File("words.txt", "r").byLine(); > >> auto maxLen = reduce!max(map!"a.length"(lazyWords)); > >> writeln(filter!((w){ return w.length == maxLen; })(lazyWords)); > >> } > >> > >> > >> Finding the max length consumes the lazy iterable, so if the text file > >> is small a possible functional style solution is to convert the lazy > >> iterable into an array (other functional style solutions are possible, > >> like reading the file twice, etc): > >> > >> // #2 > >> import std.stdio, std.array, std.algorithm; > >> void main() { > >> auto lazyWords = File("words.txt", "r").byLine(); > >> auto words = array(lazyWords); > >> auto maxLen = reduce!max(map!"a.length"(words)); > >> writeln(filter!((w){ return w.length == maxLen; })(words)); > >> } > >> > >> But #2 doesn't work, because while "words" is an array of the words, > >> byLine() uses the same buffer, so words is a sequence of useless stuff. > >> > >> To debug that code you need to dup each array item: > >> > >> // #3 > >> import std.stdio, std.array, std.algorithm; > >> void main() { > >> auto lazyWords = File("words.txt", "r").byLine(); > >> auto words = array(map!"a.dup"(lazyWords)); > >> auto maxLen = reduce!max(map!"a.length"(words)); > >> writeln(filter!((w){ return w.length == maxLen; })(words)); > >> } > >> > >> > >> #3 works, but for scripting-line programs it's not the first version I > >> write, I have had to debug the code. I think other programmers will > >> have the same troubles. So I think byLine() default behavour is a bit > >> bug-prone. Re-using the same buffer gives a good performance boost, but > >> it's often a premature optimization in scripts. I prefer a language and > >> library to use unsafe optimizations only on request. > >> > >> My suggested solution is to split File.byLine() in two member > >> functions, one returns a lazy iterable that reuses the line buffer and > >> one that doesn't reuse them. And my preferred solution is to give the > >> shorter name (so it becomes the "default") to the method that dups the > >> line, because this is the safer (less bug prone) behaviour. This is > >> also in accord with D Zen. They may be named byLine() and byFastLine() > >> or something similar. > >> > >> If you prefer the "default" one to be the less safe version, then the > >> safe version may be named byDupLine(): > >> > >> // #4 > >> import std.stdio, std.array, std.algorithm; > >> void main() { > >> auto lazyWords = File("words.txt", "r").byDupLine(); > >> auto lines = array(lazyWords); > >> auto maxLen = reduce!max(map!"a.length"(words)); > >> writeln(filter!((w){ return w.length == maxLen; })(words)); > >> } > >> > >> > >> Using the "maxs" (from > >> http://d.puremagic.com/issues/show_bug.cgi?id=4705 ), the code becomes: > >> > >> // #5 > >> import std.stdio, std.algorithm; > >> void main() { > >> auto lazyWords = File("words.txt", "r").byDupLine(); > >> writeln(maxs!"a.length"(lazyWords)); > >> } > > > > But you don't need a new string for each line to evaluate max over line > > lengths; the current byLine works. > > That's true. > > > > Generally I think buffer reuse in byLine() is too valuable to let go. > > I also agree it's wasteful. That's true, indeed. But here the point, I guess, is that you need the reuse the line sequence (whatever it is), as argument to filter. > But I think bearophile's experiment has illustrated two noteworthy > problems. The first issue is that calling filter! on the > already-consumed result of byLine() gives you a seg fault. I reproduced > this locally, but haven't pinpointed the problem. > > The second one is this: > > array(file.byline()) > > which gives a wrong result because of the buffer reuse. Either it > should not compile or it should idup every line (both of which are not > entirely satisfactory, but they're better than getting wrong results). I tried it myself, and the result is very strange: unittest { // text.txt holds: "abc\ndefgh\nijk\nlmn\nopqrs\ntuvwx\nyz\n" auto lazyWords = File("test.txt").byLine(); //~ auto words = Array(map!((ll){return ll.dup;})(lazyWords)); auto words = Array(lazyWords); writeln(words); auto lengths = map!((l){return l.length;})(words); writeln(lengths); auto maxLength = reduce!max(lengths); writeln(maxLength); auto longWords = filter!((l){return l.length==maxLength;})(words); writeln(longWords); } ==> [yz , yz wx, yz , yz , yz wx, yz wx, yz] [3, 5, 3, 3, 5, 5, 2] 5 [yz wx, yz wx, yz wx] I don't understand why/how, asp that lengths is correct ;-) When replacing the "auto words = ..." statements, we get as expected: [abc, defgh, ijk, lmn, opqrs, tuvwx, yz] [3, 5, 3, 3, 5, 5, 2] 5 [defgh, opqrs, tuvwx] > I think a range should be able to express if the value can be reused or > not. If the value can't be reused, then the algorithm should either not > instantiate, or in some cases it might create a copy. Makes sense. I also like Bearophile's proposal. But would let 'byLines' for the lazy/single-iteration-only version and call the dupping method simply 'lines'. For me (is it just me?), the latter correctly expresses an array-like collection of elements one can safely (re)use. Denis -- -- -- -- -- -- -- vit esse estrany ☣ spir.wikidot.com
