On 18:25 Tue 23 Mar , Iustin Pop wrote: > On Tue, Mar 23, 2010 at 01:21:49PM -0400, Nick Bowler wrote: > > On 18:11 Tue 23 Mar , Iustin Pop wrote: > > > I agree with the principle of correctness, but let's be honest - it's > > > (many) orders of magnitude between ByteString and String and Text, not > > > just a few percentage points… > > > > > > I've been struggling with this problem too and it's not nice. Every time > > > one uses the system readFile & friends (anything that doesn't read via > > > ByteStrings), it hell slow. > > > > > > Test: read a file and compute its size in chars. Input text file is > > > ~40MB in size, has one non-ASCII char. The test might seem stupid but it > > > is a simple one. ghc 6.12.1. > > > > > > Data.ByteString.Lazy (bytestring readFile + length) - < 10 miliseconds, > > > incorrect length (as expected). > > > > > > Data.ByteString.Lazy.UTF8 (system readFile + fromString + length) - 11 > > > seconds, correct length. > > > > > > Data.Text.Lazy (system readFile + pack + length) - 26s, correct length. > > > > > > String (system readfile + length) - ~1 second, correct length. > > > > Is this a mistake? Your own report shows String & readFile being an > > order of magnitude faster than everything else, contrary to your earlier > > claim. > > No, it's not a mistake. String is faster than pack to Text and length, but > it's > 100 times slower than ByteString.
Only if you don't care about obtaining the correct answer, in which case you may as well just say const 42 or somesuch, which is even faster. > My whole point is that difference between byte processing and char processing > in Haskell is not a few percentage points, but order of magnitude. I would > really like to have only the 6x penalty that Python shows, for example. Hang on a second... less than 10 milliseconds to read 40 megabytes from disk? Something's fishy here. I ran my own tests with a 400M file (419430400 bytes) consisting almost exclusively of the letter 'a' with two Japanese characters placed at every multiple of 40 megabytes (UTF-8 encoded). With Prelude.readFile/length and 5 runs, I see 10145ms, 10087ms, 10223ms, 10321ms, 10216ms. with approximately 10% of that time spent performing GC each run. With Data.Bytestring.Lazy.readFile/length and 5 runs, I see 8223ms, 8192ms, 8077ms, 8091ms, 8174ms. with approximately 20% of that time spent performing GC each run. Maybe there's some magic command line options to tune the GC for our purposes, but I only managed to make things slower. Thus, I'll handwave a bit and just shave off the GC time from each result. Prelude: 9178ms mean with a standard deviation of 159ms. Data.ByteString.Lazy: 6521ms mean with a standard deviation of 103ms. Therefore, we managed a throughput of 43 MB/s with the Prelude (and got the right answer), while we managed 61 MB/s with lazy ByteStrings (and got the wrong answer). My disk won't go much, if at all, faster than the second result, so that's good. So that's a 30% reduction in throughput. I'd say that's a lot worse than a few percentage points, but certainly not orders of magnitude. On the other hand, using Data.ByteString.Lazy.readFile and Data.ByteString.Lazy.UTF8.length, we get results of around 12000ms with approximately 5% of that time spent in GC, which is rather worse than the Prelude. Data.Text.Lazy.IO.readFile and Data.Text.Lazy.length are even worse, with results of around 25 *seconds* (!!) and 2% of that time spent in GC. GNU wc computes the correct answer as quickly as lazy bytestrings compute the wrong answer. With perl 5.8, slurping the entire file as UTF-8 computes the correct answer just as slowly as Prelude. In my first ever Python program (with python 2.6), I tried to read the entire file as a unicode string and it quickly crashes due to running out of memory (yikes!), so it earns a DNF. So, for computing the right answer with this simple test, it looks like the Prelude is the best option. We tie with Perl and lose only to GNU wc (which is written in C). Really, though, it would be nice to close that gap. -- Nick Bowler, Elliptic Technologies (http://www.elliptictech.com/) _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe