Re: tolf and detab

bearophile Sun, 08 Aug 2010 05:45:23 -0700

Andrei Alexandrescu:
> This makes me think we should have a range that detects and replaces 
> patterns lazily and on the fly.


In Python there is a helper module:
http://docs.python.org/library/fileinput.html


> I think it's worth targeting D2 to tasks that are usually handled by
> scripting languages. I've done a lot of that and it beats the hell out
> of rewriting in D a script that's grown out of control

Dynamic languages are handy but they require some rigour when you program. 
Python is probably unfit to write one million lines long programs, but if you 
train yourself a little and keep your code clean, you usually become able to 
write clean larghish programs in Python.


> That would be great so we can tune our approach. Thanks!

In my dlibs I have the xio module that read by lines efficiently, it was faster 
than the iterating on the lines of BufferedFile.
There are tons of different benchmarks that you may use, but a simple one to 
start is better, one that just iterates the file lines. See below.

Related: experiments have shown that the (oldish) Java GC improves its 
performance if it is able to keep strings (that are immutable) in a separate 
memory pool, _and_ be able to recognize duplicated strings, of course keeping 
only one string for each equality set. It's positive to do a similar experiment 
with the D GC, but first you need applications that use the GC to test if this 
idea is an improvement :-)


So I have used a minimal benchmark:

--------------------------

# Python 2.7 code
from sys import argv

def process(file_name):
    total = 0
    for line in open(file_name):
        total += len(line)
    return total

print "Total:", process(argv[1])

--------------------------

// D2 code
import std.stdio: File, writeln;

int process(string fileName) {
    int total = 0;

    auto file = File(fileName);
    foreach (rawLine; file.byLine()) {
        string line = rawLine.idup;
        total += line.length;
    }
    file.close();

    return total;
}

void main(string[] args) {
    if (args.length == 2)
        writeln("Total: ", process(args[1]));
}

--------------------------

In the D code I have added an idup to make the comparison more fair, because in 
the Python code the "line" is a true newly allocated line, you can safely use 
it as dictionary key.

I have used Python 2.7 with no Psyco JIT (http://psyco.sourceforge.net/ ) to 
speed up the Python code because it's not available yet for Python 2.7. D code 
compiled with dmd 2.047, optimized build.

As test text data I have used a concatenation of all text files here (they are 
copyrighted, but freely usable):
http://gnosis.cx/TPiP/

The result on Windows is a file of 1_116_552 bytes.

I have attached the file to itself, duplicating its length, some times, the 
result is a file of 71_459_328 bytes (this is not an fully realistic case 
because you often have many small files to read instead of a very large one).

The timings are taken with warm disk cache, so they are essentially read from 
RAM. This is not fully realistic, but if you want to write a benchmark you have 
to do this, because for me it's very hard on Windows to make sure that the disk 
cache is fully empty. So it's better to do the opposite and benchmark a warm 
file.


The output of the Python code is:
Total: 69789888
Found in 0.88 seconds (best of 6, the variance is minimal).


The output of the D code is:
Total: 69789888
Found in 1.28 seconds (best of 6, minimal variance).


If in the D2 code I comment out the idup like this:

    foreach (rawLine; file.byLine()) {
        total += rawLine.length;
    }

The output of the D code without idup is:
Total: 69789888
Found in 0.75 seconds (best of 6, minimal variance).

As you see it's a matter of GC efficiency too.

Beside the GC the cause of the higher performance of the Python code comes from 
a tuned design, you can see the function getline_via_fgets here:
http://svn.python.org/view/python/trunk/Objects/fileobject.c?revision=81275&view=markup
It uses a "stack buffer" (char buf[MAXBUFSIZE]; where MAXBUFSIZE is 300) too.

Bye,
bearophile

Re: tolf and detab

Reply via email to