Phobos usability with text files

bearophile Sun, 26 Dec 2010 08:15:25 -0800

This is related to this (closed) issue, but this time I prefer to discuss the 
topic on the newsgroup:
http://d.puremagic.com/issues/show_bug.cgi?id=4474


To show why this enhancement request is useful I use a little scripting task. A 
D2 program has to read a file that contains one word in each line, and print 
all the words with the highest length.

A first naive functional style implementation (if you keep in mind that File 
opens files in binary mode on default, this default is different from the 
Python one, I am not sure what's the best default. I think most files I open 
are text ones, so I think a bit better default is to open in text mode on 
default), it doesn't work, it produces an access violation:

// #1
import std.stdio, std.array, std.algorithm;
void main() {
    auto lazyWords = File("words.txt", "r").byLine();
    auto maxLen = reduce!max(map!"a.length"(lazyWords));
    writeln(filter!((w){ return w.length == maxLen; })(lazyWords));
}


Finding the max length consumes the lazy iterable, so if the text file is small 
a possible functional style solution is to convert the lazy iterable into an 
array (other functional style solutions are possible, like reading the file 
twice, etc):

// #2
import std.stdio, std.array, std.algorithm;
void main() {
    auto lazyWords = File("words.txt", "r").byLine();
    auto words = array(lazyWords);
    auto maxLen = reduce!max(map!"a.length"(words));
    writeln(filter!((w){ return w.length == maxLen; })(words));
}

But #2 doesn't work, because while "words" is an array of the words, byLine() 
uses the same buffer, so words is a sequence of useless stuff.

To debug that code you need to dup each array item:

// #3
import std.stdio, std.array, std.algorithm;
void main() {
    auto lazyWords = File("words.txt", "r").byLine();
    auto words = array(map!"a.dup"(lazyWords));
    auto maxLen = reduce!max(map!"a.length"(words));
    writeln(filter!((w){ return w.length == maxLen; })(words));
}


#3 works, but for scripting-line programs it's not the first version I write, I 
have had to debug the code. I think other programmers will have the same 
troubles. So I think byLine() default behavour is a bit bug-prone. Re-using the 
same buffer gives a good performance boost, but it's often a premature 
optimization in scripts. I prefer a language and library to use unsafe 
optimizations only on request.

My suggested solution is to split File.byLine() in two member functions, one 
returns a lazy iterable that reuses the line buffer and one that doesn't reuse 
them. And my preferred solution is to give the shorter name (so it becomes the 
"default") to the method that dups the line, because this is the safer (less 
bug prone) behaviour. This is also in accord with D Zen. They may be named 
byLine() and byFastLine() or something similar.

If you prefer the "default" one to be the less safe version, then the safe 
version may be named byDupLine():

// #4
import std.stdio, std.array, std.algorithm;
void main() {
    auto lazyWords = File("words.txt", "r").byDupLine();
    auto lines = array(lazyWords);
    auto maxLen = reduce!max(map!"a.length"(words));
    writeln(filter!((w){ return w.length == maxLen; })(words));
}


Using the "maxs" (from http://d.puremagic.com/issues/show_bug.cgi?id=4705 ), 
the code becomes:

// #5
import std.stdio, std.algorithm;
void main() {
    auto lazyWords = File("words.txt", "r").byDupLine();
    writeln(maxs!"a.length"(lazyWords));
}

Bye,
bearophile

Phobos usability with text files

Reply via email to