Beginners Digest, Vol 22, Issue 16

beginners-request Sun, 11 Apr 2010 09:04:26 -0700

Send Beginners mailing list submissions to
        [email protected]

To subscribe or unsubscribe via the World Wide Web, visit
        http://www.haskell.org/mailman/listinfo/beginners
or, via email, send a message with subject or body 'help' to
        [email protected]


You can reach the person managing the list at
        [email protected]

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Beginners digest..."


Today's Topics:

   1. Re:  When to use ByteString rather than [Char] ...        ?
      (Stephen Tetley)
   2.  Re: When to use ByteString rather than [Char]    ... ?
      (Maciej Piechotka)
   3. Re:  When to use ByteString rather than [Char] ...?
      (Daniel Fischer)
   4. Re:  When to use ByteString rather than [Char]    ... ?
      (Felipe Lessa)
   5. Re:  When to use ByteString rather than [Char]    ...?
      (Stephen Tetley)
   6. Re:  Re: When to use ByteString rather than [Char]        ... ?
      (Daniel Fischer)
   7. Re:  When to use ByteString rather than [Char] ...?
      (Daniel Fischer)
   8.  Re: Re: When to use ByteString rather than       [Char] ... ?
      (Maciej Piechotka)


----------------------------------------------------------------------

Message: 1
Date: Sun, 11 Apr 2010 14:29:38 +0100
From: Stephen Tetley <[email protected]>
Subject: Re: [Haskell-beginners] When to use ByteString rather than
        [Char] ...      ?
Cc: [email protected]
Message-ID:
        <[email protected]>
Content-Type: text/plain; charset=ISO-8859-1

Hi James

There's a paper describing the implementation of ByteStrings here:

http://www.cse.unsw.edu.au/~dons/papers/CSL06.html
http://www.cse.unsw.edu.au/~dons/papers/fusion.pdf

For my own work, I generally need short immutable strings and haven't
found ByteStrings compelling, though the results presented in the
above suggest [Char] is better at nothing and worse at many things.
[Hmm - insert emoticon here]

Best wishes

Stephen


------------------------------

Message: 2
Date: Sun, 11 Apr 2010 15:31:52 +0200
From: Maciej Piechotka <[email protected]>
Subject: [Haskell-beginners] Re: When to use ByteString rather than
        [Char]  ... ?
To: [email protected]
Message-ID: <1270992712.5565.43.ca...@picard>
Content-Type: text/plain; charset="utf-8"

On Sun, 2010-04-11 at 12:07 +0100, James Fisher wrote:
> Hi,
> 
> 
> After working through a few Haskell tutorials, I've come across
> numerous recommendations to use the Data.ByteString library rather
> than standard [Char], for reasons of "performance".  I'm having
> trouble swallowing this -- presumably the standard String is default
> for good reasons.  Nothing has answered this question: in what case is
> it better to use [Char]?  
> 

In most cases you need an actuall String and it is not time-critical I
believe. ByteString is... well string of bytes not char - you have no
idea whether they are encoded as utf-8, ucs-2, ascii, iso-8859-1 (or as
jpeg ;) ). If you want the next char you don't know how many bytes you
need to read (1? 2? 3? depends on contents?).

String ([Char]) have defined representation - while read/write function
might incorrect encode/decode it (up to GHC 6.12 System.IO had assumes
ascii encoding IIRC on read) it is their error.

> Could anyone point me to a good resource showing the differences
> between how [Char] and ByteString are implemented, and giving good a
> heuristic for me to decide which is better in any one case?
> 

ByteString is pointer with offset and length. Lazy ByteString is a
linked list of ByteStrings (with additional condition that none of inner
ByteStrings are empty).

In theory String is [Char] i.e. [a] i.e.

data [a] = [] | a:[a]

In other words it is linked list of characters. That, for long strings,
may be inefficient (because of cache, O(n) on random access and
necessity of checking for errors while evaluating further[1]).

I heard somewhere that actual implementations optimizes it to arrays
when it is possible (i.e. can be detected and does not messes with
non-strict semantics). However I don't know if it is true.

I *guess* that in most cases the overhead on I/O will be sufficiently
great to make the difference insignificant. However:

- If you need exact byte representation - for example for compression,
digital signatures etc. you need ByteString
- If you need to operate on text rather then bytes use String or
specialized data structures as Data.Text & co.
- If you don't care about performance and need easy of use (pattern
matching etc.) use String.
- If you have no special requirements than you can ByteString

While some languages (for example C, Python, Ruby) mixes the text and
it's representation I guess it is not always the best way. String in
such separation is an text while ByteString is a binary representation
of something (can be text, picture, compresses data etc.).

> 
> Best,
> 
> 
> James Fisher

Regards

[1] However the O(n) access time and checking of errors are still
introduced by decoding string. So if you need UTF-8 you will still get
the O(n) access time ;)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
Url : 
http://www.haskell.org/pipermail/beginners/attachments/20100411/6c0f9424/attachment-0001.bin

------------------------------

Message: 3
Date: Sun, 11 Apr 2010 16:42:43 +0200
From: Daniel Fischer <[email protected]>
Subject: Re: [Haskell-beginners] When to use ByteString rather than
        [Char] ...?
To: [email protected]
Message-ID: <[email protected]>
Content-Type: text/plain;  charset="iso-8859-1"

Am Sonntag 11 April 2010 15:29:38 schrieb Stephen Tetley:
> Hi James
>
> There's a paper describing the implementation of ByteStrings here:
>
> http://www.cse.unsw.edu.au/~dons/papers/CSL06.html
> http://www.cse.unsw.edu.au/~dons/papers/fusion.pdf
>
> For my own work, I generally need short immutable strings and haven't
> found ByteStrings compelling,

ByteStrings shine for long strings.
When you're using long strings, ByteStrings almost certainly are *much* 
faster (utf8-ByteStrings are probably significantly slower, but should 
still beat [Char] comfortably).

I've found ByteStrings better than [Char] when dealing with short strings 
only for a few things (e.g. as keys of Maps, ByteStrings tend to be better 
[at least if using ByteStrings there doesn't introduce too much packing and 
unpacking], things like edit-distance are faster on ByteStrings;
UArray Int Char is slower than ByteString [in my measurements] for these 
tasks, but it can also be used for characters > toEnum 255 and isn't too 
much slower).
Other things [see below] were faster for short [Char] than for short 
ByteStrings.

When dealing with short strings, in my experience there are rarely 
compelling reasons to choose one over the other.

> though the results presented in the
> above suggest [Char] is better at nothing

[Char] is (far) better at sorting short Strings; it often is better for map 
and filter.

> and worse at many things.

[Char]-IO is abysmally slow in comparison, [Char] uses much more memory, 
random access is horrible for lists.

> [Hmm - insert emoticon here]
>
> Best wishes
>
> Stephen


------------------------------

Message: 4
Date: Sun, 11 Apr 2010 12:08:18 -0300
From: Felipe Lessa <[email protected]>
Subject: Re: [Haskell-beginners] When to use ByteString rather than
        [Char]  ... ?
To: James Fisher <[email protected]>
Cc: [email protected]
Message-ID: <[email protected]>
Content-Type: text/plain; charset=us-ascii

On Sun, Apr 11, 2010 at 12:07:34PM +0100, James Fisher wrote:
> use the Data.ByteString library rather than standard [Char]

These are different.

IN THE PAST:
  - We used String for everything, both strings and binary data.

IN RECENT PAST:
  - We used String for treating... strings of characters.
  - We used ByteString for binary data.

  - To read an UTF-8 string we used a package like utf8-string:
     1) Read file as ByteString.
     2) Convert UTF-8 bytes into String, a list of Chars.

TODAY:
  - ByteString is used for binary data.
  - String is used for text when performance isn't critical.
  - Data.Text (from package 'text') is used for text when time
    and/or space efficiency is needed.

Data.Text uses the same 'tricks' as ByteString, but while the
latter encodes bytes, the former encodes Char's.

HTH,

--
Felipe.


------------------------------

Message: 5
Date: Sun, 11 Apr 2010 16:15:22 +0100
From: Stephen Tetley <[email protected]>
Subject: Re: [Haskell-beginners] When to use ByteString rather than
        [Char]  ...?
Cc: [email protected]
Message-ID:
        <[email protected]>
Content-Type: text/plain; charset=ISO-8859-1

On 11 April 2010 15:42, Daniel Fischer <[email protected]> wrote:
[SNIP]
> When dealing with short strings, in my experience there are rarely
> compelling reasons to choose one over the other.


Hi Daniel

Thanks - I was slightly surprised at the results in the paper because
the 'cons' test for was equal, I thought bytestrings have to do a bit
more work for a 'cons' - looking at the code lazy bytestring uses one
constructor and a bit of C memory poking, which is the C memory poking
more than I'd expect the [Char] version to do.

The only 'determinant' I've found for choosing which type for short
strings is if I'm using a library that forces one or the other on me,
otherwise I'm swayed by the simplicity of [Char].

Best wishes

Stephen


------------------------------

Message: 6
Date: Sun, 11 Apr 2010 17:17:00 +0200
From: Daniel Fischer <[email protected]>
Subject: Re: [Haskell-beginners] Re: When to use ByteString rather
        than [Char]     ... ?
To: [email protected]
Message-ID: <[email protected]>
Content-Type: text/plain;  charset="utf-8"

Am Sonntag 11 April 2010 15:31:52 schrieb Maciej Piechotka:
> On Sun, 2010-04-11 at 12:07 +0100, James Fisher wrote:
> > Hi,
> >
> >
> > After working through a few Haskell tutorials, I've come across
> > numerous recommendations to use the Data.ByteString library rather
> > than standard [Char], for reasons of "performance".  I'm having
> > trouble swallowing this -- presumably the standard String is default
> > for good reasons.

But performance is none of those reasons.
The choice to make String a synonym for [Char] instead of a specialised 
datatype allows you to easily manipulate strings with the plethora of list-
processing functions from the standard libraries.
However, that means some things can't be fast and there's a significant 
space overhead for string handling.

> > Nothing has answered this question: in what case is
> > it better to use [Char]?
>
> In most cases you need an actuall String and it is not time-critical I
> believe. ByteString is... well string of bytes not char - you have no
> idea whether they are encoded as utf-8, ucs-2, ascii, iso-8859-1 (or as
> jpeg ;) ). If you want the next char you don't know how many bytes you
> need to read (1? 2? 3? depends on contents?).

And
- sorting short strings
- for using map or filter, [Char] is often superior

>
> String ([Char]) have defined representation - while read/write function
> might incorrect encode/decode it (up to GHC 6.12 System.IO had assumes
> ascii encoding IIRC on read) it is their error.
>
> > Could anyone point me to a good resource showing the differences
> > between how [Char] and ByteString are implemented, and giving good a
> > heuristic for me to decide which is better in any one case?
>
> ByteString is pointer with offset and length. Lazy ByteString is a
> linked list of ByteStrings (with additional condition that none of inner
> ByteStrings are empty).

And it's a head-strict list, not the usual lazy Haskell list.

>
> In theory String is [Char] i.e. [a] i.e.
>
> data [a] = [] | a:[a]
>
> In other words it is linked list of characters. That, for long strings,
> may be inefficient (because of cache, O(n) on random access and
> necessity of checking for errors while evaluating further[1]).
>
> I heard somewhere that actual implementations optimizes it to arrays
> when it is possible (i.e. can be detected and does not messes with
> non-strict semantics). However I don't know if it is true.

I've never heard of that before, so I'm skeptical.

>
> I *guess* that in most cases the overhead on I/O will be sufficiently
> great to make the difference insignificant. However:

? which difference?

Try reading large files. Count the lines or something else, as long as it's 
simple. The speed difference between ByteString-IO and [Char]-IO is 
enormous.
When you do something more complicated the difference in IO-speed may 
become insignificant.
On the other hand, when you're appending a lot of short lines to a file one 
by one, there's a good chance that [Char]-IO is actually faster.

>
> - If you need exact byte representation - for example for compression,
> digital signatures etc. you need ByteString
> - If you need to operate on text rather then bytes use String or
> specialized data structures as Data.Text & co.
> - If you don't care about performance and need easy of use (pattern
> matching etc.) use String.
> - If you have no special requirements than you can ByteString
>
> While some languages (for example C, Python, Ruby) mixes the text and
> it's representation I guess it is not always the best way. String in
> such separation is an text while ByteString is a binary representation
> of something (can be text, picture, compresses data etc.).
>
> > Best,
> >
> >
> > James Fisher
>
> Regards
>
> [1] However the O(n) access time and checking of errors are still
> introduced by decoding string. So if you need UTF-8 you will still get
> the O(n) access time ;)

It might then be a good idea to use a UArray Int Char if you need repeated 
random access.



------------------------------

Message: 7
Date: Sun, 11 Apr 2010 17:55:14 +0200
From: Daniel Fischer <[email protected]>
Subject: Re: [Haskell-beginners] When to use ByteString rather than
        [Char] ...?
To: [email protected]
Message-ID: <[email protected]>
Content-Type: text/plain;  charset="iso-8859-1"

Am Sonntag 11 April 2010 17:15:22 schrieb Stephen Tetley:
> On 11 April 2010 15:42, Daniel Fischer <[email protected]> wrote:
> [SNIP]
>
> > When dealing with short strings, in my experience there are rarely
> > compelling reasons to choose one over the other.
>
> Hi Daniel
>
> Thanks - I was slightly surprised at the results in the paper because
> the 'cons' test for was equal, I thought bytestrings have to do a bit
> more work for a 'cons' - looking at the code lazy bytestring uses one
> constructor and a bit of C memory poking, which is the C memory poking
> more than I'd expect the [Char] version to do.

Well, I guess it depends on what actually happens with fusion (a single 
cons doesn't take significant time for either). If repeated conses lead to 
a chain of one-element chunks, I'd expect that to be significantly slower 
than [Char], but if it's rewritten to

- allocate a new chunk,
- write from end and decrement offset counter,

it shouldn't be slower.

>
> The only 'determinant' I've found for choosing which type for short
> strings is if I'm using a library that forces one or the other on me,

Sure, that's pretty compelling - as long as you don't need two libraries 
with different choices :)

> otherwise I'm swayed by the simplicity of [Char].
>
> Best wishes
>
> Stephen


------------------------------

Message: 8
Date: Sun, 11 Apr 2010 18:04:14 +0200
From: Maciej Piechotka <[email protected]>
Subject: [Haskell-beginners] Re: Re: When to use ByteString rather
        than    [Char] ... ?
To: [email protected]
Message-ID: <1271001853.6703.19.ca...@picard>
Content-Type: text/plain; charset="utf-8"

On Sun, 2010-04-11 at 17:17 +0200, Daniel Fischer wrote:
> 
> >
> > I *guess* that in most cases the overhead on I/O will be
> sufficiently
> > great to make the difference insignificant. However:
> 
> ? which difference?
> 
> Try reading large files.

Well - while large files are not not-important IIRC most files are small
(< 4 KiB) - at least on *nix file systems (at least that's the core
'idea' of reiserfs/reiser4 filesystems).

I guess that for large strings something like text (I think I mentioned
it) is better

> Count the lines or something else, as long as it's 
> simple. The speed difference between ByteString-IO and [Char]-IO is 
> enormous.
> When you do something more complicated the difference in IO-speed may 
> become insignificant.

Hmm. As newline is a single-byte character in most encodings it is
believable. However what is the difference in counting chars (not bytes
- chars)? I wouldn't be surprise is difference was smaller.

Of course:
 - I haven't done any tests. I guessed (which I written)
 - It wasn't written what is the typical case
 - What is 'significant' difference

Regards
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
Url : 
http://www.haskell.org/pipermail/beginners/attachments/20100411/5f3ac53f/attachment.bin

------------------------------

_______________________________________________
Beginners mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/beginners


End of Beginners Digest, Vol 22, Issue 16
*****************************************

Beginners Digest, Vol 22, Issue 16

Reply via email to