I'm writing a tokeniser for a legacy programming language in Haskell.
I started with the obvious
        main = getContents >>= print . tokenise
where tokenise maps its way down a list of characters.  This is very
simple, very pleasant, and worked like a charm.

However, the language has an INCLUDE directive, so I'm going to have
to call readFile or something in the middle of tokenising, so the
main tokeniser loop can't be a pure String -> [Token] function any
more.  It occurred to me to wonder whether it would be better to try
to stick with the list of characters style as much as possible (and
use readFile) or to switch over to character at a time processing
(with openFile, hIsEOF, hGetChar, hClose).

So I decided on an experiment:  count the lines both ways.

Method 1A (pure list processing)

    main = getContents >>= print . doit 0

    doit n ('\n':cs) = doit (n+1) cs
    doit n ( _  :cs) = doit  n    cs
    doit n []        = n

Method 1B (doit is in the IO monad)

    main = getContents >>= doit 0

    doit n ('\n':cs) = doit (n+1) cs
    doit n ( _  :cs) = doit  n    cs
    doit n []        = print n

Method 2 (character at a time)

    main = doit 0

    doit n = isEOF >>= \eof ->
             if eof then print n else
             getChar >>= \c ->
             doit (if c == '\n' then n+1 else n)

I tried this using GHC 6.4 on a 2.8 GHz PC running linux 2.6.20-1.2933.fc6 I don't know whether gcc was involved at all, but in case it was, the gcc
version was 4.1.1.  All test programs were compiled using
        ghc -o foobar -O foobar.hs

The input was a 3.8 MB file.

wc:         0.08 seconds total
Method 1A:  0.30 seconds total
Method 1B:  0.47 seconds total
Method 2:   ran out of stack
Method 2':  2.21 seconds total.

When I got the stack overflow, I suspected that this might be due to
lazy evaluation of the integer argument to doit, so I changed the last
line to
             doit $! (if c == '\n' then n+1 else n)
and that worked.  In *retrospect*, it is really obvious why this was
necessary, but I must say that in *prospect* I wasn't expecting it.

Other things being equal, I would have expected Method 2' to be
faster.  I'm now wondering if there is some other consequence of
laziness that I am just not seeing.

I'm also a little puzzled that there is a measurable difference
between method 1A and method 1B.

Things that I am not asking about:
 - ghc version; that's what's installed on the machine and if I want
   something different I may have to wait many days.
 - style comments on the list processing versions; yes I know how to
   write getContents >>= print . length . filter (=='\n') and that
   really has no relevance to the full tokeniser and in any case is
   slower than Method 1A.



_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to