Re: [Haskell-cafe] Copying Arrays

2008-05-30 Thread Ketil Malde
Duncan Coutts [EMAIL PROTECTED] writes:

 Because I'm writing the Unicode-friendly ByteString =p

 He's designing a proper Unicode type along the lines of ByteString.

So - storing 22.5 bit code points instead of 8-bit quantities?  Or
storing whatever representation from the input, and providing a nice
interface on top?

 Perhaps I'm not understanding.  Why wouldn't you use ByteString for I/O,

Like everybody else, my first reaction is to put a layer (like Char8)
on top of lazy bytestrings.  For variable-length encodings, you lose
direct indexing, but I think this is not very common, and if you need
it, you should convert to a fixed length encoding instead.  Since a BS
is basically a (pointer to array,offset,length) triple, it should be
relatively easy to ensure that you don't break a wide char between
chunks by adjusting the length (which doesn't have to match the
actual array length).

 The reason we do not want to re-use ByteString as the underlying
 representation is because they're not good for short strings and we
 expect that for Unicode text (more than arbitrary blobs of binary data)
 people will want efficient short strings.

I guess this is where I don't follow: why would you need more short
strings for Unicode text than for ASCII or 8-bit latin text?

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re[2]: [Haskell-cafe] Copying Arrays

2008-05-30 Thread Bulat Ziganshin
Hello Ketil,

Friday, May 30, 2008, 11:00:10 AM, you wrote:

 I guess this is where I don't follow: why would you need more short
 strings for Unicode text than for ASCII or 8-bit latin text?

BS package require too much memory for storing short strings.
alternative package that minimizes memory would be useful


-- 
Best regards,
 Bulatmailto:[EMAIL PROTECTED]

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Copying Arrays

2008-05-30 Thread Johan Tibell
On Fri, May 30, 2008 at 9:00 AM, Ketil Malde [EMAIL PROTECTED] wrote:
 Duncan Coutts [EMAIL PROTECTED] writes:
 The reason we do not want to re-use ByteString as the underlying
 representation is because they're not good for short strings and we
 expect that for Unicode text (more than arbitrary blobs of binary data)
 people will want efficient short strings.

 I guess this is where I don't follow: why would you need more short
 strings for Unicode text than for ASCII or 8-bit latin text?

But ByteStrings are neither ASCII nor 8-bit Latin text! The latter
might be internally represented using an 8-bit encoding but saying
that they are the same would be to confuse representation with
intended use. The use case for ByteString is representing a sequence
of bytes used in e.g. fast binary socket and file I/O. The intent of
the not-yet-existing Unicode string is to represent text not bytes.
These are different use cases so having to make different trade-offs
shouldn't come as a surprise. To give just one example, short
(Unicode) strings are common as keys in associative data structures
like maps where fast allocation, small memory footprint and fast
comparison is important. That being said it's entirely possible that I
didn't get your point.

Can I also here insert a plea for keeping lazy I/O out of the new
Unicode module?

-- Johan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Copying Arrays

2008-05-30 Thread Ketil Malde
Johan Tibell [EMAIL PROTECTED] writes:

 I guess this is where I don't follow: why would you need more short
 strings for Unicode text than for ASCII or 8-bit latin text?

 But ByteStrings are neither ASCII nor 8-bit Latin text! 
  [...] 
 The intent of the not-yet-existing Unicode string is to represent
 text not bytes. 

Right, so this will replace the .Char8 modules as well?  What confused
me was my misunderstanding Duncan to mean that Unicode text would
somehow imply shorter strings than non-Unicode (i.e. 8-bit) text.

 To give just one example, short (Unicode) strings are common as keys
 in associative data structures like maps

I guess typically, you'd break things down to words, so strings of
lenght 4-10 or so.  BS uses three words and LBS four (IIRC), so the
cost of sharing typically outweighs the benefit.

 Can I also here insert a plea for keeping lazy I/O out of the new
 Unicode module?

I use ByteString.Lazy almost exclusively.  I realize it there's a
penalty in time and space, but the ability to write applications that
stream over multi-Gb files is essential.

Of course, these applications couldn't care less about Unicode, so
perhaps the usage is different.

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Copying Arrays

2008-05-30 Thread Johan Tibell
Hi!

On Fri, May 30, 2008 at 10:38 AM, Ketil Malde [EMAIL PROTECTED] wrote:
 Johan Tibell [EMAIL PROTECTED] writes:
 The intent of the not-yet-existing Unicode string is to represent
 text not bytes.

 Right, so this will replace the .Char8 modules as well?  What confused
 me was my misunderstanding Duncan to mean that Unicode text would
 somehow imply shorter strings than non-Unicode (i.e. 8-bit) text.

Yes.

 To give just one example, short (Unicode) strings are common as keys
 in associative data structures like maps

 I guess typically, you'd break things down to words, so strings of
 lenght 4-10 or so.  BS uses three words and LBS four (IIRC), so the
 cost of sharing typically outweighs the benefit.

I'm not sure if you would have much sharing in a map as the keys will be unique.

 Can I also here insert a plea for keeping lazy I/O out of the new
 Unicode module?

 I use ByteString.Lazy almost exclusively.  I realize it there's a
 penalty in time and space, but the ability to write applications that
 stream over multi-Gb files is essential.

Lazy I/O comes with a penalty in terms of correctness! Pretending that
I/O and the underlying resource allocations (e.g. file handles) aren't
observable is bad. Lazy I/O is kinda, maybe usable for small scripts
that reads a file or two an spits out a result but for servers it
doesn't work at all. Lazy I/O requires unsafe* functions and is
therefore, unsafe. The finalizers required can be arbitrary complex
depending on what kind of resources need to be allocated. The simple
case is a file handle but there's no reason we might need sockets,
locks, etc to create the lazy ByteString. Here are two possible
interfaces for safe I/O. One isstream based one with explicit close
and the other fold based one (i.e. inversion of control):

 import qualified Data.ByteString as S

 -- Stream based I/O.
 class InputStream s where
   read :: s - IO Word8
   readN :: s - Int - IO S.ByteString  -- efficient block reads
   close :: s - IO ()

 openBinaryFile :: InputStream s = FilePath - IO s

or a left fold over the file's content. The 'foldBytes' function can
close the file at EOF.

 -- Left fold/callback based I/O.
 foldBytes :: FilePath - (seed - Word8 - Either seed seed) - seed - IO 
 seed
 -- Efficient block reads.
 foldChunks :: FilePath - (seed - S.ByteString - Either seed seed) - seed 
 - IO seed

on top of this you might want monadic versions of the above two
functions. The case for a Unicode type are analogous.

 Of course, these applications couldn't care less about Unicode, so
 perhaps the usage is different.

The issue of lazy I/O is orthogonal to ByteString vs Unicode(String).

Cheers,

Johan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Copying Arrays

2008-05-30 Thread Duncan Coutts

On Fri, 2008-05-30 at 10:38 +0200, Ketil Malde wrote:
 Johan Tibell [EMAIL PROTECTED] writes:
 
  I guess this is where I don't follow: why would you need more short
  strings for Unicode text than for ASCII or 8-bit latin text?
 
  But ByteStrings are neither ASCII nor 8-bit Latin text! 
   [...] 
  The intent of the not-yet-existing Unicode string is to represent
  text not bytes. 
 
 Right, so this will replace the .Char8 modules as well? 

No, there is still a use-case for that, namely all these network
protocols and file formats that mix binary and ASCII data.

The only bad thing about the .Char8 modules is how people have picked
them up to represent text exclusively which is a step back from [Char]
in terms of Unicode stuff.

 What confused me was my misunderstanding Duncan to mean that Unicode
 text would somehow imply shorter strings than non-Unicode (i.e. 8-bit)
 text.

ByteString is already not a good choice for short ASCII text.

Duncan

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Copying Arrays

2008-05-30 Thread Isaac Dupree
it makes me wonder: can we support concatenation with sharing (e.g. the 
rope data structure)

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Copying Arrays

2008-05-30 Thread Ketil Malde
Johan Tibell [EMAIL PROTECTED] writes:

 Lazy I/O comes with a penalty in terms of correctness! 

Is there a page describing this in more detail?  I believe my programs
to be correct, but would like to know if there are additional
pitfalls, beyond hClosing a handle with outstanding (unevaluated)
reads. 

 Lazy I/O is kinda, maybe usable for small scripts that reads a file
 or two and spits out a result but for servers it doesn't work at
 all.

So don't use it for servers, then?

 Here are two possible interfaces for safe I/O. One isstream based
 one with explicit close and the other fold based one (i.e. inversion
 of control):

Say I have two files with variable-size records, that need to be
merged.  In other words, I can do something like:

  do
xs - readFile X
ys - readFile Y
return $ zipWith combine (parseX xs) (parseY ys)

Clearly in the small script category and not servers, but is it
still unsafe?

 -- Stream based I/O.
 -- Left fold/callback based I/O.

Perhaps I don't quite understand the concepts, but I don't see how to
implement the example as elegantly and simply with these approaches?

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Copying Arrays

2008-05-30 Thread Bryan O'Sullivan
Ketil Malde wrote:

 I guess this is where I don't follow: why would you need more short
 strings for Unicode text than for ASCII or 8-bit latin text?

Because ByteString is optimised for dealing with big blobs of binary
data, its performance when you split a big ByteString into a pile of
words is quite poor.  Also, people use ByteStrings for dealing with text
mostly because their native character sets let them get away with it,
not because it's an intrinsically good idea.

b
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Copying Arrays

2008-05-29 Thread Tom Harper
I'm trying to implement some file I/O where I can read in a file to an
array, but do so without having to know how much space I will need.
(Before you suggest it, lists are too slow/space consuming.)  I was
thinking that one possible solution is to use a strategy used in
dynamic arrays, where everytime you run out of space you double the
capacity of the array.  Trouble is, most arrays in Haskell would
require you to traverse the entire array and copy each individual
cell, which would make it worthless.

Are there any low-level array types (MutableByteArray#, for example)
that support a memcpy-like function where I could copy the entire
array into the new one instead of copying it one value at a time?  Is
there another solution that I'm missing?


-- 
Tom Harper
MSc Computer Science '08
University of Oxford
Mobile: +44 (0)7533 998 591
Skype: +1 949 273 4627 (harpertom)
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Copying Arrays

2008-05-29 Thread Jefferson Heard
Define too slow, time-consuming?  I've dealt with this problem a lot
at this point, and I've been able to slurp up CSV files of several
gigabytes in the same amount of time as I generally do in C.  I have a
feeling an array solution is just going to bog you down more, however
if you insist, I would also suggest writing your I/O in C and
returning a ForeignPtr to something and work from there.

-- Jeff

On Thu, May 29, 2008 at 12:21 PM, Tom Harper
[EMAIL PROTECTED] wrote:
 I'm trying to implement some file I/O where I can read in a file to an
 array, but do so without having to know how much space I will need.
 (Before you suggest it, lists are too slow/space consuming.)  I was
 thinking that one possible solution is to use a strategy used in
 dynamic arrays, where everytime you run out of space you double the
 capacity of the array.  Trouble is, most arrays in Haskell would
 require you to traverse the entire array and copy each individual
 cell, which would make it worthless.

 Are there any low-level array types (MutableByteArray#, for example)
 that support a memcpy-like function where I could copy the entire
 array into the new one instead of copying it one value at a time?  Is
 there another solution that I'm missing?


 --
 Tom Harper
 MSc Computer Science '08
 University of Oxford
 Mobile: +44 (0)7533 998 591
 Skype: +1 949 273 4627 (harpertom)
 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe




-- 
I try to take things like a crow; war and chaos don't always ruin a
picnic, they just mean you have to be careful what you swallow.

-- Jessica Edwards
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Copying Arrays

2008-05-29 Thread Bulat Ziganshin
Hello Tom,

Thursday, May 29, 2008, 8:21:25 PM, you wrote:

 Are there any low-level array types (MutableByteArray#, for example)
 that support a memcpy-like function

yes, MBA# allows to use memcpy. you can find examples of its
usage right in the array package, for example:

thawSTUArray :: Ix i = UArray i e - ST s (STUArray s i e)
thawSTUArray (UArray l u arr#) = ST $ \s1# -
case sizeofByteArray# arr#  of { n# -
case newByteArray# n# s1#   of { (# s2#, marr# #) -
case unsafeCoerce# memcpy marr# arr# n# s2# of { (# s3#, () #) -
(# s3#, STUArray l u marr# #) }}}

foreign import ccall unsafe memcpy
memcpy :: MutableByteArray# RealWorld - ByteArray# - Int# - IO ()

i can give you further explanations if required

-- 
Best regards,
 Bulatmailto:[EMAIL PROTECTED]

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Copying Arrays

2008-05-29 Thread Adrian Neumann

Isn't fast IO what ByteStrings where invented for?

Adrian

Tom Harper schrieb:

I'm trying to implement some file I/O where I can read in a file to an
array, but do so without having to know how much space I will need.
(Before you suggest it, lists are too slow/space consuming.)  I was
thinking that one possible solution is to use a strategy used in
dynamic arrays, where everytime you run out of space you double the
capacity of the array.  Trouble is, most arrays in Haskell would
require you to traverse the entire array and copy each individual
cell, which would make it worthless.

Are there any low-level array types (MutableByteArray#, for example)
that support a memcpy-like function where I could copy the entire
array into the new one instead of copying it one value at a time?  Is
there another solution that I'm missing?







signature.asc
Description: OpenPGP digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Copying Arrays

2008-05-29 Thread Jefferson Heard
Exactly.  Someone on the list gave me this example awhile back for
reading CSV files.  I can process a gigabyte (simple unpack and print
to dev/null for IO testing purposes) in about two and a half seconds
using this code.

import Data.ByteString.Lazy.Char8 as C

-- | Read a datafile and turn it into lists of columns
readDatafileAndTranspose name = do
sheet - (transpose . map (C.split '\t') . C.lines) `fmap` C.readFile name
return $ foldl' go M.empty sheet
  where go m (x:xs) = M.insert (C.unpack x) xs m


2008/5/29 Adrian Neumann [EMAIL PROTECTED]:
 Isn't fast IO what ByteStrings where invented for?

 Adrian

 Tom Harper schrieb:

 I'm trying to implement some file I/O where I can read in a file to an
 array, but do so without having to know how much space I will need.
 (Before you suggest it, lists are too slow/space consuming.)  I was
 thinking that one possible solution is to use a strategy used in
 dynamic arrays, where everytime you run out of space you double the
 capacity of the array.  Trouble is, most arrays in Haskell would
 require you to traverse the entire array and copy each individual
 cell, which would make it worthless.

 Are there any low-level array types (MutableByteArray#, for example)
 that support a memcpy-like function where I could copy the entire
 array into the new one instead of copying it one value at a time?  Is
 there another solution that I'm missing?





 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe





-- 
I try to take things like a crow; war and chaos don't always ruin a
picnic, they just mean you have to be careful what you swallow.

-- Jessica Edwards
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Copying Arrays

2008-05-29 Thread Bryan O'Sullivan
Tom Harper wrote:
 I'm trying to implement some file I/O where I can read in a file to an
 array, but do so without having to know how much space I will need.

Why not just read it into a lazy ByteString?  Are you looking to use an
array with elements of a different type?  You could then convert it to a
strict ByteString.

b
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Copying Arrays

2008-05-29 Thread Tom Harper
 Why not just read it into a lazy ByteString?  Are you looking to use an
 array with elements of a different type?  You could then convert it to a
 strict ByteString.

b


Because I'm writing the Unicode-friendly ByteString =p


-- 
Tom Harper
MSc Computer Science '08
University of Oxford
Mobile: +44 (0)7533 998 591
Skype: +1 949 273 4627 (harpertom)
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Copying Arrays

2008-05-29 Thread Bryan O'Sullivan
Tom Harper wrote:

 Because I'm writing the Unicode-friendly ByteString =p

Perhaps I'm not understanding.  Why wouldn't you use ByteString for I/O,
even if you're writing a different library?  After all, ByteString's own
internals use memcpy.

b
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Copying Arrays

2008-05-29 Thread Jed Brown
On Thu 2008-05-29 19:19, Tom Harper wrote:
  Why not just read it into a lazy ByteString?  Are you looking to use an
  array with elements of a different type?  You could then convert it to a
  strict ByteString.
 
 Because I'm writing the Unicode-friendly ByteString =p

Uh, ByteString is Unicode-agnostic.  ByteString.Char8 is not.  So why not do IO
with lazy ByteString and parse into your own representation (which might look a
lot like StorableVector)?

Jed


pgpCfBIxeQ4l0.pgp
Description: PGP signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Copying Arrays

2008-05-29 Thread Duncan Coutts

On Thu, 2008-05-29 at 11:25 -0700, Bryan O'Sullivan wrote:
 Tom Harper wrote:
 
  Because I'm writing the Unicode-friendly ByteString =p
 
 Perhaps I'm not understanding.  Why wouldn't you use ByteString for I/O,
 even if you're writing a different library?  After all, ByteString's own
 internals use memcpy.

Just to clarify Tom's question...

He's designing a proper Unicode type along the lines of ByteString.
However it does not use a ByteString underneath so we cannot re-use the
IO operations from ByteString, though obviously we can steal ideas.

The reason we do not want to re-use ByteString as the underlying
representation is because they're not good for short strings and we
expect that for Unicode text (more than arbitrary blobs of binary data)
people will want efficient short strings.

So instead of using a ForeignPtr to pinned heap memory (read: slow
allocation, lots of fragmentation) we're looking at using unpinned heap
arrays. So that means MutableByteArr# (or STUArray for the first
prototype).

We cannot use memcpy because it operates on raw pointers but that's no
good for a movable heap object like a ByteArr#.

So that's the background to the question.

I think the answer is either that there's a primitive to do this copying
in the GHC.* libs or we'll need to add one.

Duncan

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re[2]: [Haskell-cafe] Copying Arrays

2008-05-29 Thread Bulat Ziganshin
Hello Duncan,

Thursday, May 29, 2008, 10:57:02 PM, you wrote:

 We cannot use memcpy because it operates on raw pointers but that's no
 good for a movable heap object like a ByteArr#.

it's false assumption - memcpy is used for copying non-pinned arrays
as in the code from Data.Array.Base i've citated. the idea is that
GC shouldn't occur between getting array address and actual memcpy
call. you may consult Simon Marlow

-- 
Best regards,
 Bulatmailto:[EMAIL PROTECTED]

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Copying Arrays

2008-05-29 Thread Don Stewart
duncan.coutts:
 
 On Thu, 2008-05-29 at 11:25 -0700, Bryan O'Sullivan wrote:
  Tom Harper wrote:
  
   Because I'm writing the Unicode-friendly ByteString =p
  
  Perhaps I'm not understanding.  Why wouldn't you use ByteString for I/O,
  even if you're writing a different library?  After all, ByteString's own
  internals use memcpy.
 
 Just to clarify Tom's question...
 
 He's designing a proper Unicode type along the lines of ByteString.
 However it does not use a ByteString underneath so we cannot re-use the
 IO operations from ByteString, though obviously we can steal ideas.
 
 The reason we do not want to re-use ByteString as the underlying
 representation is because they're not good for short strings and we
 expect that for Unicode text (more than arbitrary blobs of binary data)
 people will want efficient short strings.
 
 So instead of using a ForeignPtr to pinned heap memory (read: slow
 allocation, lots of fragmentation) we're looking at using unpinned heap
 arrays. So that means MutableByteArr# (or STUArray for the first
 prototype).
 
 We cannot use memcpy because it operates on raw pointers but that's no
 good for a movable heap object like a ByteArr#.
 
 So that's the background to the question.
 
 I think the answer is either that there's a primitive to do this copying
 in the GHC.* libs or we'll need to add one.

There are these guys,

foreign import ccall unsafe __hscore_memcpy_dst_off
   memcpy_baoff_ba :: RawBuffer - CInt - RawBuffer - CSize - IO (Ptr ())

foreign import ccall unsafe __hscore_memcpy_dst_off
  memcpy_baoff_ptr :: RawBuffer - CInt - Ptr a - CSize - IO (Ptr ())

type RawBuffer = MutableByteArray# RealWorld
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe