Strings are slow

2002-11-19 Thread Lauri Alanko
Hello. Here's a grep program:


module Main where

import Text.Regex
import System.IO
import System.Environment
import Control.Monad
import Data.Maybe

main = do [re] - getArgs
  let rx = mkRegex re
  let loop = do line - getLine
when (isJust (matchRegex rx line)) (putStrLn line)
eof - isEOF
unless eof loop
  loop


It turned out that this is remarkably slow. The first problem was with
inlining. If this is compiled with ghc-5.04 -O -ddump-simpl, I get:

  case GHC.IOBase.unsafePerformIO
 @ (Data.Maybe.Maybe
(GHC.Base.String,
 GHC.Base.String,
 GHC.Base.String,
 [GHC.Base.String]))
 (Text.Regex.Posix.regexec (Text.Regex.mkRegex re) a731)
  of wild4 {

Ie. the regex is compiled anew every time a string is matched. A bug?

Anyway, without optimization the code produced is reasonable, but still
horrendously slow. Testing with a simple word as a pattern from a 7.3MB,
800kline file, the running time was 37.5 seconds. For comparison, a similar
program in mzscheme (interpreted!) took 7.3 seconds while the system
grep, of course, took 0.4 seconds.

I did some profiling by creating new top-level bindings for matchRegex
and getLine (is there a better way?):


total time  =   53.34 secs   (2667 ticks @ 20 ms)
total alloc = 1,172,482,496 bytes  (excludes profiling overheads)

COST CENTREMODULE   %time %alloc

match  Main  69.7   56.8
getl   Main  23.9   40.2
main   Main   6.33.0


So it seems like all the time is spent just converting ByteArrays to
char lists to C arrays. This makes me wonder how sensible it really is
to represent strings by char lists. Yes, it's nice and uniform and lazy,
but...

How can I get this faster, then? PackedStrings are not very useful
because they just don't support enough operations (getline, matchregex)
alone, and having to convert them to Strings sort of defeats their
purpose. _Any_ operation that provides only a String interface condemns
us to a gazillion allocations. And by default all char*-based foreign
interfaces are represented with Strings on the Haskell side.

Maybe a generic Textual class with at least String and PackedString (and
ByteArray?) as instances would help? Then the common string-based
operations could all have separate implementations for separate
representations. With heavy specialization, of course. This would be
especially useful if the FFI (especially withCString) supported it.

Or alternatively, maybe the foldr/build rewriting trick could be used to
eliminate some redundant conversions between representations?

Just throwing ideas in the air here.


Lauri Alanko
[EMAIL PROTECTED]

___
Glasgow-haskell-users mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users



RE: Strings are slow

2002-11-19 Thread Simon Marlow
 module Main where
 
 import Text.Regex
 import System.IO
 import System.Environment
 import Control.Monad
 import Data.Maybe
 
 main = do [re] - getArgs
 let rx = mkRegex re
 let loop = do line - getLine
   when (isJust (matchRegex rx line)) 
 (putStrLn line)
   eof - isEOF
   unless eof loop
 loop
 
 
 It turned out that this is remarkably slow. The first problem was with
 inlining. If this is compiled with ghc-5.04 -O -ddump-simpl, I get:
 
 case GHC.IOBase.unsafePerformIO
@ (Data.Maybe.Maybe
   (GHC.Base.String,
GHC.Base.String,
GHC.Base.String,
[GHC.Base.String]))
(Text.Regex.Posix.regexec 
 (Text.Regex.mkRegex re) a731)
 of wild4 {

This is indeed an optimiser bug, but it's the result of a design
decision: GHC is a bit laid back about inlining things inside the state
lambda in the IO monad, because it often enables important
optimisations.  However, we're experimenting with modifying this
optimisation so that it will be less likely to kill performance in the
way it did in your example.

In the meantime, you can add rx as an argument to loop, that will be
enough to fool GHC into not inlining rx.

 Ie. the regex is compiled anew every time a string is matched. A bug?
 
 Anyway, without optimization the code produced is reasonable, 
 but still
 horrendously slow. Testing with a simple word as a pattern 
 from a 7.3MB,
 800kline file, the running time was 37.5 seconds. For 
 comparison, a similar
 program in mzscheme (interpreted!) took 7.3 seconds while the system
 grep, of course, took 0.4 seconds.
 
 I did some profiling by creating new top-level bindings for matchRegex
 and getLine (is there a better way?):
 
 
 total time  =   53.34 secs   (2667 ticks @ 20 ms)
 total alloc = 1,172,482,496 bytes  (excludes 
 profiling overheads)
 
 COST CENTREMODULE   %time %alloc
 
 match  Main  69.7   56.8
 getl   Main  23.9   40.2
 main   Main   6.33.0
 
 
 So it seems like all the time is spent just converting ByteArrays to
 char lists to C arrays. This makes me wonder how sensible it really is
 to represent strings by char lists. Yes, it's nice and 
 uniform and lazy, but...

String processing in Haskell is very slow, due to the list-of-characters
representation.  A more complete PackedString library with better
integration with other libraries (like Text.Regex) would help a lot for
these kind of examples.

Cheers,
Simon
___
Glasgow-haskell-users mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users