Re: Phobos usability with text files

Michel Fortin Sun, 26 Dec 2010 09:55:26 -0800

On 2010-12-26 12:13:41 -0500, Andrei Alexandrescu<[email protected]> said:

On 12/26/10 10:12 AM, bearophile wrote:
This is related to this (closed) issue, but this time I prefer todiscuss the topic on the newsgroup:
http://d.puremagic.com/issues/show_bug.cgi?id=4474
To show why this enhancement request is useful I use a little scriptingtask. A D2 program has to read a file that contains one word in eachline, and print all the words with the highest length.
A first naive functional style implementation (if you keep in mind thatFile opens files in binary mode on default, this default is differentfrom the Python one, I am not sure what's the best default. I thinkmost files I open are text ones, so I think a bit better default is toopen in text mode on default), it doesn't work, it produces an accessviolation:
// #1
import std.stdio, std.array, std.algorithm;
void main() {
     auto lazyWords = File("words.txt", "r").byLine();
     auto maxLen = reduce!max(map!"a.length"(lazyWords));
     writeln(filter!((w){ return w.length == maxLen; })(lazyWords));
}
Finding the max length consumes the lazy iterable, so if the text fileis small a possible functional style solution is to convert the lazyiterable into an array (other functional style solutions are possible,like reading the file twice, etc):
// #2
import std.stdio, std.array, std.algorithm;
void main() {
     auto lazyWords = File("words.txt", "r").byLine();
     auto words = array(lazyWords);
     auto maxLen = reduce!max(map!"a.length"(words));
     writeln(filter!((w){ return w.length == maxLen; })(words));
}
But #2 doesn't work, because while "words" is an array of the words,byLine() uses the same buffer, so words is a sequence of useless stuff.
To debug that code you need to dup each array item:

// #3
import std.stdio, std.array, std.algorithm;
void main() {
     auto lazyWords = File("words.txt", "r").byLine();
     auto words = array(map!"a.dup"(lazyWords));
     auto maxLen = reduce!max(map!"a.length"(words));
     writeln(filter!((w){ return w.length == maxLen; })(words));
}
#3 works, but for scripting-line programs it's not the first version Iwrite, I have had to debug the code. I think other programmers willhave the same troubles. So I think byLine() default behavour is a bitbug-prone. Re-using the same buffer gives a good performance boost, butit's often a premature optimization in scripts. I prefer a language andlibrary to use unsafe optimizations only on request.
My suggested solution is to split File.byLine() in two memberfunctions, one returns a lazy iterable that reuses the line buffer andone that doesn't reuse them. And my preferred solution is to give theshorter name (so it becomes the "default") to the method that dups theline, because this is the safer (less bug prone) behaviour. This isalso in accord with D Zen. They may be named byLine() and byFastLine()or something similar.
If you prefer the "default" one to be the less safe version, then thesafe version may be named byDupLine():
// #4
import std.stdio, std.array, std.algorithm;
void main() {
     auto lazyWords = File("words.txt", "r").byDupLine();
     auto lines = array(lazyWords);
     auto maxLen = reduce!max(map!"a.length"(words));
     writeln(filter!((w){ return w.length == maxLen; })(words));
}
Using the "maxs" (fromhttp://d.puremagic.com/issues/show_bug.cgi?id=4705 ), the code becomes:
// #5
import std.stdio, std.algorithm;
void main() {
     auto lazyWords = File("words.txt", "r").byDupLine();
     writeln(maxs!"a.length"(lazyWords));
}
But you don't need a new string for each line to evaluate max over linelengths; the current byLine works.


That's true.

Generally I think buffer reuse in byLine() is too valuable to let go.


I also agree it's wasteful.

But I think bearophile's experiment has illustrated two noteworthyproblems. The first issue is that calling filter! on thealready-consumed result of byLine() gives you a seg fault. I reproducedthis locally, but haven't pinpointed the problem.


The second one is this:

        array(file.byline())

which gives a wrong result because of the buffer reuse. Either itshould not compile or it should idup every line (both of which are notentirely satisfactory, but they're better than getting wrong results).

I think a range should be able to express if the value can be reused ornot. If the value can't be reused, then the algorithm should either notinstantiate, or in some cases it might create a copy.



--
Michel Fortin
[email protected]
http://michelf.com/

Re: Phobos usability with text files

Reply via email to