Beginners Digest, Vol 22, Issue 17

beginners-request Mon, 12 Apr 2010 00:38:44 -0700

Send Beginners mailing list submissions to
        [email protected]

To subscribe or unsubscribe via the World Wide Web, visit
        http://www.haskell.org/mailman/listinfo/beginners
or, via email, send a message with subject or body 'help' to
        [email protected]


You can reach the person managing the list at
        [email protected]

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Beginners digest..."


Today's Topics:

   1. Re:  Re: Re: When to use ByteString rather than   [Char] ... ?
      (Daniel Fischer)
   2.  Re: Re: Re: When to use ByteString rather than   [Char] ... ?
      (Maciej Piechotka)
   3. Re:  Re: Re: Re: When to use ByteString rather    than [Char]
      ... ? (Daniel Fischer)
   4.  hmatrix, and Float (MAN)
   5.  Space leak in a fold even after the usual fixes (Travis Erdman)


----------------------------------------------------------------------

Message: 1
Date: Sun, 11 Apr 2010 22:07:53 +0200
From: Daniel Fischer <[email protected]>
Subject: Re: [Haskell-beginners] Re: Re: When to use ByteString rather
        than    [Char] ... ?
To: [email protected]
Message-ID: <[email protected]>
Content-Type: text/plain;  charset="utf-8"

Am Sonntag 11 April 2010 18:04:14 schrieb Maciej Piechotka:
> On Sun, 2010-04-11 at 17:17 +0200, Daniel Fischer wrote:
> > > I *guess* that in most cases the overhead on I/O will be
> >
> > sufficiently
> >
> > > great to make the difference insignificant. However:
> >
> > ? which difference?

I meant: difference between ByteString-IO and [Char]-IO or which 
difference?

> >
> > Try reading large files.
>
> Well - while large files are not not-important IIRC most files are small
> (< 4 KiB) - at least on *nix file systems (at least that's the core
> 'idea' of reiserfs/reiser4 filesystems).

Well, sometimes one has to process large files even though most are small.
If the processing itself is simple, IO-speed is important then.

>
> I guess that for large strings something like text (I think I mentioned
> it) is better
>

Unless you know you only have to deal with one-byte characters, when plain 
ByteStrings are the simplest and fastest method.

But those are special cases, in general I agree.

> > Count the lines or something else, as long as it's
> > simple. The speed difference between ByteString-IO and [Char]-IO is
> > enormous.
> > When you do something more complicated the difference in IO-speed may
> > become insignificant.
>
> Hmm. As newline is a single-byte character in most encodings it is
> believable.

You can measure it yourself :)
cat-ing together a few copies of /usr/share/dict/words should give a large 
enough file.

> However what is the difference in counting chars (not bytes
> - chars)? I wouldn't be surprise is difference was smaller.

Nor would I. In fact I'd be surprised if it wasn't smaller. [see below]

This example was meant to illustrate the difference in IO-speed, so an 
extremely simple processing was appropriate. The combination of doing IO 
and processing is something different. If you're doing complicated things, 
IO time has a good chance to become negligible.

>
> Of course:
>  - I haven't done any tests. I guessed (which I written)

I just have done a test.
Input file: "big.txt" from Norvig's spelling checker (6488666 bytes, no 
characters outside latin1 range) and the same with
('\n':map toEnum [256 .. 10000] ++ "\n") appended.

Code:

main = A.readFile "big.txt" >>= print . B.length

where (A,B) is a suitable combination of 
- Data.ByteString[.Lazy][.Char8][.UTF8]
- Data.Text[.IO]
- Prelude

Times:
Data.ByteString[.Lazy]: 0.00s
Data.ByteString.UTF8: 0.14s
Prelude:  0.21s
Data.ByteString.Lazy.UTF8: 0.56s
Data.Text:  0.66s

Of course Data.ByteString didn't count characters but bytes, so for the 
modified file, those printed larger numbers than the others (well, it's 
BYTEString, isn't it?).

It's a little unfair, though, as the ByteString[.Lazy] variants don't need 
to look at each individual byte, so I also let them and Prelude.String 
count newlines to see how fast they can inspect each character/byte,

BS[.Lazy]: 0.02s
Prelude: 0.23s

both take 0.02s to inspect each item.

To summarise:
* ByteString-IO is blazingly fast, since all it has to do is get a sequence 
of bytes from disk into memory.
* [Char]-IO is much slower because it has to transform the sequence of 
bytes to individual characters as they come.
* counting utf-8 encoded characters in a ByteString is - unsurprisingly - 
slow. I'm a bit surprised *how* slow it is for lazy ByteStrings.
(Caveat: I've no idea whether Data.ByteString.UTF8 would suffer from more 
multi-byte characters to the point where String becomes faster. My guess is 
no, not for single traversal. For multiple traversal, String has to 
identify each individual character only once, while BS.UTF8 must do it each 
time, so then String may be faster.)
* Data.Text isn't very fast for that one.

>  - It wasn't written what is the typical case

Aren't there several quite different typical cases?
One fairly typical case is big ASCII or latin1 files (e.g. fasta files, 
numerical data). For those, usually ByteString is by far the best choice.

Another fairly typical case is *text* processing, possibly with text in 
different scripts (latin, hebrew, kanji, ...). Depending on what you want 
to do (and the encoding), any of Prelude.String, Data.Text and 
Data.ByteString[.Lazy].UTF8 may be a good choice, vanilla ByteStrings 
probably aren't. String and Text also have the advantage that you aren't 
tied to utf-8.

Choose your datatype according to your problem, not one size fits all.

>  - What is 'significant' difference

Depends of course. For a task performed once, who cares whether it takes 
one second or three? One hour or three, however, is a significant 
difference (assuming approximately equal times to write the code).
Sometimes 10% difference in performance is important, sometimes a factor of 
10 isn't.

The point is that you should be aware of the performance differences when 
making your choice.

>
> Regards

Cheers,
Daniel



------------------------------

Message: 2
Date: Mon, 12 Apr 2010 01:01:36 +0200
From: Maciej Piechotka <[email protected]>
Subject: [Haskell-beginners] Re: Re: Re: When to use ByteString rather
        than    [Char] ... ?
To: [email protected]
Message-ID: <1271026893.5385.131.ca...@picard>
Content-Type: text/plain; charset="utf-8"

On Sun, 2010-04-11 at 22:07 +0200, Daniel Fischer wrote:
> Am Sonntag 11 April 2010 18:04:14 schrieb Maciej Piechotka:
> >
> > Of course:
> >  - I haven't done any tests. I guessed (which I written)
> 
> I just have done a test.
> Input file: "big.txt" from Norvig's spelling checker (6488666 bytes, no 
> characters outside latin1 range) and the same with
> ('\n':map toEnum [256 .. 10000] ++ "\n") appended.
> 

Converted myspell polish dictonary (a few % of non-ascii chars) added
twice (6531616 bytes).

> Code:
> 
> main = A.readFile "big.txt" >>= print . B.length
> 

{-# LANGUAGE BangPatterns #-}
import Control.Applicative
import qualified Data.ByteString as S
import qualified Data.ByteString.UTF8 as SU
import qualified Data.ByteString.Lazy as L
import qualified Data.ByteString.Lazy.UTF8 as LU
import qualified Data.Text as T
import qualified Data.Text.IO as T
import qualified Data.Text.Lazy as TL
import qualified Data.Text.Lazy.IO as TL
import Data.List hiding (find)
import Data.Time.Clock
import System.Mem
import System.IO hiding (readFile)
import Text.Printf
import Prelude hiding (readFile)

readFile :: String -> IO String
readFile p = do h <- openFile p ReadMode
                hSetEncoding h utf8
                hGetContents h
              

measure :: IO a -> IO (NominalDiffTime)
measure a = do performGC
               start <- getCurrentTime
               !_ <- a
               end <- getCurrentTime
               return $! end `diffUTCTime` start

find !x v | fromEnum v == 32 = x + 1
          | otherwise        = x

find' !x 'Ä' = x + 1
find' !x 'Ä' = x + 1
find' !x  _  = x

main = printMeasure "Length - ByteString" (S.length <$> S.readFile
"dict") >>
       printMeasure "Length - Lazy ByteString" (L.length <$> L.readFile
"dict") >>
       printMeasure "Length - String" (length <$> readFile "dict") >>
       printMeasure "Length - UTF8 ByteString" (SU.length <$> S.readFile
"dict") >>
       printMeasure "Length - UTF8 Lazy ByteString" (LU.length <$>
L.readFile "dict") >>
       printMeasure "Length - Text" (T.length <$> T.readFile "dict") >>
       printMeasure "Length - Lazy Text" (TL.length <$> TL.readFile
"dict") >>
       printMeasure "Searching - ByteString" (S.foldl' find 0 <$>
S.readFile "dict") >>
       printMeasure "Searching - ByteString" (L.foldl' find 0 <$>
L.readFile "dict") >>
       printMeasure "Searching - String" (foldl' find 0 <$> readFile
"dict") >>
       printMeasure "Searching - UTF8 ByteString" (SU.foldl find 0 <$>
S.readFile "dict") >>
       printMeasure "Searching - UTF8 Lazy ByteString" (LU.foldl find 0
<$> L.readFile "dict") >>
       printMeasure "Searching - Text" (T.foldl' find 0 <$> T.readFile
"dict") >>
       printMeasure "Searching - Lazy Text" (TL.foldl' find 0 <$>
TL.readFile "dict") >>
       printMeasure "Searching Ä - String" (foldl' find' 0 <$> readFile
"dict") >>
       printMeasure "Searching Ä - UTF8 ByteString" (SU.foldl find' 0 <
$> S.readFile "dict") >>
       printMeasure "Searching Ä - UTF8 Lazy ByteString" (LU.foldl find'
0 <$> L.readFile "dict") >>
       printMeasure "Searching Ä - Text" (T.foldl' find' 0 <$>
T.readFile "dict") >>
       printMeasure "Searching Ä - Lazy Text" (TL.foldl' find' 0 <$>
TL.readFile "dict")

printMeasure :: String -> IO a -> IO ()
printMeasure s a = measure a >>= \v -> printf "%-40s %8.5f s\n" (s ++
":") (realToFrac v :: Float)

> where (A,B) is a suitable combination of 
> - Data.ByteString[.Lazy][.Char8][.UTF8]
> - Data.Text[.IO]
> - Prelude
> 
> Times:
> Data.ByteString[.Lazy]: 0.00s
> Data.ByteString.UTF8: 0.14s
> Prelude:  0.21s
> Data.ByteString.Lazy.UTF8: 0.56s
> Data.Text:  0.66s
> 

                       Optimized:

Length - ByteString:                      0.01223 s
Length - Lazy ByteString:                 0.00328 s
Length - String:                          0.15474 s
Length - UTF8 ByteString:                 0.19945 s
Length - UTF8 Lazy ByteString:            0.30123 s
Length - Text:                            0.70438 s
Length - Lazy Text:                       0.62137 s

String seems to be fastest correct

Searching - ByteString:                   0.04604 s
Searching - ByteString:                   0.04424 s
Searching - String:                       0.18178 s
Searching - UTF8 ByteString:              0.32606 s
Searching - UTF8 Lazy ByteString:         0.42984 s
Searching - Text:                         0.26599 s
Searching - Lazy Text:                    0.37320 s

While ByteString is clear winner String is actually good compared to
others.

Searching Ä - String:                     0.18557 s
Searching Ä - UTF8 ByteString:            0.32752 s
Searching Ä - UTF8 Lazy ByteString:       0.43811 s
Searching Ä - Text:                       0.28401 s
Searching Ä - Lazy Text:                  0.37612 

String is fastest? Hmmm.

                       Compiled:

Length - ByteString:                      0.00861 s
Length - Lazy ByteString:                 0.00409 s
Length - String:                          0.16059 s
Length - UTF8 ByteString:                 0.20165 s
Length - UTF8 Lazy ByteString:            0.31885 s
Length - Text:                            0.70891 s
Length - Lazy Text:                       0.65553 s

ByteString is also clear winner but String once again wins in 'correct'
section.

Searching - ByteString:                   1.27414 s
Searching - ByteString:                   1.27303 s
Searching - String:                       0.56831 s
Searching - UTF8 ByteString:              0.68742 s
Searching - UTF8 Lazy ByteString:         0.75883 s
Searching - Text:                         1.16121 s
Searching - Lazy Text:                    1.76678 s

I mean... what? I may be doing something wrong 

Searching Ä - String:                     0.32612 s
Searching Ä - UTF8 ByteString:            0.41564 s
Searching Ä - UTF8 Lazy ByteString:       0.52919 s
Searching Ä - Text:                       0.87463 s
Searching Ä - Lazy Text:                  1.52369 s

No comment.

                       Intepreted

Length - ByteString:                      0.00511 s
Length - Lazy ByteString:                 0.00378 s
Length - String:                          0.16657 s
Length - UTF8 ByteString:                 0.21639 s
Length - UTF8 Lazy ByteString:            0.33952 s
Length - Text:                            0.79771 s
Length - Lazy Text:                       0.65320 s

As with others.

Searching - ByteString:                   9.12051 s
Searching - ByteString:                   8.94038 s
Searching - String:                       8.57391 s
Searching - UTF8 ByteString:              7.71766 s
Searching - UTF8 Lazy ByteString:         7.79422 s
Searching - Text:                         8.34435 s
Searching - Lazy Text:                    9.07538 s

Now they are pretty much equal.

Searching Ä - String:                     3.17010 s
Searching Ä - UTF8 ByteString:            3.94399 s
Searching Ä - UTF8 Lazy ByteString:       3.92382 s
Searching Ä - Text:                       3.32901 s
Searching Ä - Lazy Text:                  4.18038 s

Hmm. Still the best?

Your test:
                        Optimized  Compiled  Interpreted
ByteString:             0.011      0.011     0.421
ByteString Lazy:        0.006      0.006     0.535
String:                 0.237      0.240     0.650
Text:                   0.767      0.720     1.192
Text Lazy:              0.661      0.614     1.061
ByteString UTF8:        0.204      0.204     0.631
ByteString Lazy UTF8:   0.386      0.309     0.744

System:
Core 2 Duo T9600 2.80 GHz, 2 GiB RAM
Gentoo Linux x86-64.
Linux 2.6.33 + gentoo patches + ck.
Glibc 2.11
GHC 6.12.1
base 4.2.0.0
bytestring 0.9.1.5
text 0.7.1.0
utf8-string 0.3.6

PS. Tests were repeated a few times and each gave similar results.

> 
> >  - It wasn't written what is the typical case
> 
> Aren't there several quite different typical cases?
> One fairly typical case is big ASCII or latin1 files (e.g. fasta files, 
> numerical data). For those, usually ByteString is by far the best choice.
> 

On the other hand - if you load the numerical data it is likely that:
- It will have some labels. The labels can happen to need non-ascii or
non-latin elements
- Biggest time will be spent on operating on numbers then strings.

> Another fairly typical case is *text* processing, possibly with text in 
> different scripts (latin, hebrew, kanji, ...). Depending on what you want 
> to do (and the encoding), any of Prelude.String, Data.Text and 
> Data.ByteString[.Lazy].UTF8 may be a good choice, vanilla ByteStrings 
> probably aren't. String and Text also have the advantage that you aren't 
> tied to utf-8.
> 
> Choose your datatype according to your problem, not one size fits all.
> 

My measurements seems to prefer String but they are probably wrong.

Regards
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
Url : 
http://www.haskell.org/pipermail/beginners/attachments/20100411/5663b187/attachment-0001.bin

------------------------------

Message: 3
Date: Mon, 12 Apr 2010 02:56:40 +0200
From: Daniel Fischer <[email protected]>
Subject: Re: [Haskell-beginners] Re: Re: Re: When to use ByteString
        rather  than [Char] ... ?
To: [email protected]
Cc: Maciej Piechotka <[email protected]>
Message-ID: <[email protected]>
Content-Type: text/plain;  charset="utf-8"

Am Montag 12 April 2010 01:01:36 schrieb Maciej Piechotka:
> On Sun, 2010-04-11 at 22:07 +0200, Daniel Fischer wrote:
> > Am Sonntag 11 April 2010 18:04:14 schrieb Maciej Piechotka:
> > > Of course:
> > >  - I haven't done any tests. I guessed (which I written)
> >
> > I just have done a test.
> > Input file: "big.txt" from Norvig's spelling checker (6488666 bytes,
> > no characters outside latin1 range) and the same with
> > ('\n':map toEnum [256 .. 10000] ++ "\n") appended.
>
> Converted myspell polish dictonary (a few % of non-ascii chars) added
> twice (6531616 bytes).

>
>                        Optimized:
>
> Length - ByteString:                      0.01223 s
> Length - Lazy ByteString:                 0.00328 s
> Length - String:                          0.15474 s
> Length - UTF8 ByteString:                 0.19945 s
> Length - UTF8 Lazy ByteString:            0.30123 s
> Length - Text:                            0.70438 s
> Length - Lazy Text:                       0.62137 s
>
> String seems to be fastest correct

For me, strict UTF8 ByteString was faster than String, but as for you, both 
played in the same league.

>
> Searching - ByteString:                   0.04604 s
> Searching - ByteString:                   0.04424 s
> Searching - String:                       0.18178 s
> Searching - UTF8 ByteString:              0.32606 s
> Searching - UTF8 Lazy ByteString:         0.42984 s
> Searching - Text:                         0.26599 s
> Searching - Lazy Text:                    0.37320 s
>
> While ByteString is clear winner String is actually good compared to
> others.

The ByteStrings should be faster if you use count instead of foldl' find 0.
Anyway, where applicable, ByteStrings are much faster for certain tasks - I 
suspect they wouldn't do so well if one had a (map function . filter 
predicate) on the input.

I'm surprised here that
a) searching takes so much longer than calculating the length for BS.UTF8
b) searching is *much faster* than calculating the length for Text
c) both, UTF8 BS and Text, are slower than String

>
> Searching Ä - String:                     0.18557 s
> Searching Ä - UTF8 ByteString:            0.32752 s
> Searching Ä - UTF8 Lazy ByteString:       0.43811 s
> Searching Ä - Text:                       0.28401 s
> Searching Ä - Lazy Text:                  0.37612
>
> String is fastest? Hmmm.

Not much difference to the previous.

>
>                        Compiled:

Compiled means no optimisations? That's a *bad* idea for ByteStrings and 
Text. You only get the fusion magic with optimisations turned on, without, 
well...

>
> Length - ByteString:                      0.00861 s
> Length - Lazy ByteString:                 0.00409 s
> Length - String:                          0.16059 s
> Length - UTF8 ByteString:                 0.20165 s
> Length - UTF8 Lazy ByteString:            0.31885 s
> Length - Text:                            0.70891 s
> Length - Lazy Text:                       0.65553 s
>
> ByteString is also clear winner but String once again wins in 'correct'
> section.
>
> Searching - ByteString:                   1.27414 s
> Searching - ByteString:                   1.27303 s
> Searching - String:                       0.56831 s
> Searching - UTF8 ByteString:              0.68742 s
> Searching - UTF8 Lazy ByteString:         0.75883 s
> Searching - Text:                         1.16121 s
> Searching - Lazy Text:                    1.76678 s
>
> I mean... what? I may be doing something wrong

Yes, using ByteString and Text without optimisations :)


>
> PS. Tests were repeated a few times and each gave similar results.
>
> > >  - It wasn't written what is the typical case
> >
> > Aren't there several quite different typical cases?
> > One fairly typical case is big ASCII or latin1 files (e.g. fasta
> > files, numerical data). For those, usually ByteString is by far the
> > best choice.
>
> On the other hand - if you load the numerical data it is likely that:
> - It will have some labels. The labels can happen to need non-ascii or
> non-latin elements

Possible. But label-free formats of n columns of numbers are fairly common 
(and easier to handle).

> - Biggest time will be spent on operating on numbers then strings.
>

In that case, it is of course less important which type you use for IO.

> > Another fairly typical case is *text* processing, possibly with text
> > in different scripts (latin, hebrew, kanji, ...). Depending on what
> > you want to do (and the encoding), any of Prelude.String, Data.Text
> > and Data.ByteString[.Lazy].UTF8 may be a good choice, vanilla
> > ByteStrings probably aren't. String and Text also have the advantage
> > that you aren't tied to utf-8.
> >
> > Choose your datatype according to your problem, not one size fits all.
>
> My measurements seems to prefer String but they are probably wrong.

Yes, measurements always are wrong ;)
More seriously, you measured a couple of tasks on one system. For different 
tasks and other systems, different results should be expected.

The best choice depends on the task. For SPOJ problems, ByteStrings are 
what you'll want. For text processing, probably not.
If you want to find a pattern in a long ASCII string, however, they likely 
are.

>
> Regards

Cheers


------------------------------

Message: 4
Date: Sun, 11 Apr 2010 22:37:22 -0300
From: MAN <[email protected]>
Subject: [Haskell-beginners] hmatrix, and Float
To: [email protected]
Message-ID: <1271036242.2166.20.ca...@dy-book>
Content-Type: text/plain; charset="UTF-8"

I noticed that hmatrix defines the auxiliary type 'Element', and
provides instances for Double and Complex, but not Float. Then Vector
and Matrix are defined in terms of Element... is there any reason why
Float is intentionally left out.

I'd like to use Numeric.LinearAlgebra with both single and double
precision so as to compare results and times... could I?



------------------------------

Message: 5
Date: Mon, 12 Apr 2010 00:38:32 -0700 (PDT)
From: Travis Erdman <[email protected]>
Subject: [Haskell-beginners] Space leak in a fold even after the usual
        fixes
To: [email protected]
Message-ID: <[email protected]>
Content-Type: text/plain; charset="iso-8859-1"

The driver of my algorithm looks like this:

foldl' processfn nullarray (take arg infinitelist)

where processfn takes an array and the next set of inputs,
processes, and delivers a new updated array (using the
Data.Vector library).
 
Apparently, I have a space leak ... the "maximum residency"
varies linearly with the size of "arg" supplied, garbage
collection consumes ~75% of CPU time, and, if the arg is too
big, the whole thing crashes with an out of memory
error.  The algorithm should operate in constant
space.

As you can see, I'm using a strict left-fold and also have
made the accumulating array strict in the processfn
definition with a bang pattern.  So, I'm sort of at a
loss as to how to resolve this.

The help provided on this list has been outstanding, thanks
to all of you; hope you have something left in the tank for
this one!
 


      
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
http://www.haskell.org/pipermail/beginners/attachments/20100412/30e5381e/attachment.html

------------------------------

_______________________________________________
Beginners mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/beginners


End of Beginners Digest, Vol 22, Issue 17
*****************************************

Beginners Digest, Vol 22, Issue 17

Reply via email to