Re: [Haskell-cafe] Copying Arrays
Duncan Coutts [EMAIL PROTECTED] writes: Because I'm writing the Unicode-friendly ByteString =p He's designing a proper Unicode type along the lines of ByteString. So - storing 22.5 bit code points instead of 8-bit quantities? Or storing whatever representation from the input, and providing a nice interface on top? Perhaps I'm not understanding. Why wouldn't you use ByteString for I/O, Like everybody else, my first reaction is to put a layer (like Char8) on top of lazy bytestrings. For variable-length encodings, you lose direct indexing, but I think this is not very common, and if you need it, you should convert to a fixed length encoding instead. Since a BS is basically a (pointer to array,offset,length) triple, it should be relatively easy to ensure that you don't break a wide char between chunks by adjusting the length (which doesn't have to match the actual array length). The reason we do not want to re-use ByteString as the underlying representation is because they're not good for short strings and we expect that for Unicode text (more than arbitrary blobs of binary data) people will want efficient short strings. I guess this is where I don't follow: why would you need more short strings for Unicode text than for ASCII or 8-bit latin text? -k -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re[2]: [Haskell-cafe] Copying Arrays
Hello Ketil, Friday, May 30, 2008, 11:00:10 AM, you wrote: I guess this is where I don't follow: why would you need more short strings for Unicode text than for ASCII or 8-bit latin text? BS package require too much memory for storing short strings. alternative package that minimizes memory would be useful -- Best regards, Bulatmailto:[EMAIL PROTECTED] ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Copying Arrays
On Fri, May 30, 2008 at 9:00 AM, Ketil Malde [EMAIL PROTECTED] wrote: Duncan Coutts [EMAIL PROTECTED] writes: The reason we do not want to re-use ByteString as the underlying representation is because they're not good for short strings and we expect that for Unicode text (more than arbitrary blobs of binary data) people will want efficient short strings. I guess this is where I don't follow: why would you need more short strings for Unicode text than for ASCII or 8-bit latin text? But ByteStrings are neither ASCII nor 8-bit Latin text! The latter might be internally represented using an 8-bit encoding but saying that they are the same would be to confuse representation with intended use. The use case for ByteString is representing a sequence of bytes used in e.g. fast binary socket and file I/O. The intent of the not-yet-existing Unicode string is to represent text not bytes. These are different use cases so having to make different trade-offs shouldn't come as a surprise. To give just one example, short (Unicode) strings are common as keys in associative data structures like maps where fast allocation, small memory footprint and fast comparison is important. That being said it's entirely possible that I didn't get your point. Can I also here insert a plea for keeping lazy I/O out of the new Unicode module? -- Johan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Copying Arrays
Johan Tibell [EMAIL PROTECTED] writes: I guess this is where I don't follow: why would you need more short strings for Unicode text than for ASCII or 8-bit latin text? But ByteStrings are neither ASCII nor 8-bit Latin text! [...] The intent of the not-yet-existing Unicode string is to represent text not bytes. Right, so this will replace the .Char8 modules as well? What confused me was my misunderstanding Duncan to mean that Unicode text would somehow imply shorter strings than non-Unicode (i.e. 8-bit) text. To give just one example, short (Unicode) strings are common as keys in associative data structures like maps I guess typically, you'd break things down to words, so strings of lenght 4-10 or so. BS uses three words and LBS four (IIRC), so the cost of sharing typically outweighs the benefit. Can I also here insert a plea for keeping lazy I/O out of the new Unicode module? I use ByteString.Lazy almost exclusively. I realize it there's a penalty in time and space, but the ability to write applications that stream over multi-Gb files is essential. Of course, these applications couldn't care less about Unicode, so perhaps the usage is different. -k -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Copying Arrays
Hi! On Fri, May 30, 2008 at 10:38 AM, Ketil Malde [EMAIL PROTECTED] wrote: Johan Tibell [EMAIL PROTECTED] writes: The intent of the not-yet-existing Unicode string is to represent text not bytes. Right, so this will replace the .Char8 modules as well? What confused me was my misunderstanding Duncan to mean that Unicode text would somehow imply shorter strings than non-Unicode (i.e. 8-bit) text. Yes. To give just one example, short (Unicode) strings are common as keys in associative data structures like maps I guess typically, you'd break things down to words, so strings of lenght 4-10 or so. BS uses three words and LBS four (IIRC), so the cost of sharing typically outweighs the benefit. I'm not sure if you would have much sharing in a map as the keys will be unique. Can I also here insert a plea for keeping lazy I/O out of the new Unicode module? I use ByteString.Lazy almost exclusively. I realize it there's a penalty in time and space, but the ability to write applications that stream over multi-Gb files is essential. Lazy I/O comes with a penalty in terms of correctness! Pretending that I/O and the underlying resource allocations (e.g. file handles) aren't observable is bad. Lazy I/O is kinda, maybe usable for small scripts that reads a file or two an spits out a result but for servers it doesn't work at all. Lazy I/O requires unsafe* functions and is therefore, unsafe. The finalizers required can be arbitrary complex depending on what kind of resources need to be allocated. The simple case is a file handle but there's no reason we might need sockets, locks, etc to create the lazy ByteString. Here are two possible interfaces for safe I/O. One isstream based one with explicit close and the other fold based one (i.e. inversion of control): import qualified Data.ByteString as S -- Stream based I/O. class InputStream s where read :: s - IO Word8 readN :: s - Int - IO S.ByteString -- efficient block reads close :: s - IO () openBinaryFile :: InputStream s = FilePath - IO s or a left fold over the file's content. The 'foldBytes' function can close the file at EOF. -- Left fold/callback based I/O. foldBytes :: FilePath - (seed - Word8 - Either seed seed) - seed - IO seed -- Efficient block reads. foldChunks :: FilePath - (seed - S.ByteString - Either seed seed) - seed - IO seed on top of this you might want monadic versions of the above two functions. The case for a Unicode type are analogous. Of course, these applications couldn't care less about Unicode, so perhaps the usage is different. The issue of lazy I/O is orthogonal to ByteString vs Unicode(String). Cheers, Johan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Copying Arrays
On Fri, 2008-05-30 at 10:38 +0200, Ketil Malde wrote: Johan Tibell [EMAIL PROTECTED] writes: I guess this is where I don't follow: why would you need more short strings for Unicode text than for ASCII or 8-bit latin text? But ByteStrings are neither ASCII nor 8-bit Latin text! [...] The intent of the not-yet-existing Unicode string is to represent text not bytes. Right, so this will replace the .Char8 modules as well? No, there is still a use-case for that, namely all these network protocols and file formats that mix binary and ASCII data. The only bad thing about the .Char8 modules is how people have picked them up to represent text exclusively which is a step back from [Char] in terms of Unicode stuff. What confused me was my misunderstanding Duncan to mean that Unicode text would somehow imply shorter strings than non-Unicode (i.e. 8-bit) text. ByteString is already not a good choice for short ASCII text. Duncan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Copying Arrays
it makes me wonder: can we support concatenation with sharing (e.g. the rope data structure) ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Copying Arrays
Johan Tibell [EMAIL PROTECTED] writes: Lazy I/O comes with a penalty in terms of correctness! Is there a page describing this in more detail? I believe my programs to be correct, but would like to know if there are additional pitfalls, beyond hClosing a handle with outstanding (unevaluated) reads. Lazy I/O is kinda, maybe usable for small scripts that reads a file or two and spits out a result but for servers it doesn't work at all. So don't use it for servers, then? Here are two possible interfaces for safe I/O. One isstream based one with explicit close and the other fold based one (i.e. inversion of control): Say I have two files with variable-size records, that need to be merged. In other words, I can do something like: do xs - readFile X ys - readFile Y return $ zipWith combine (parseX xs) (parseY ys) Clearly in the small script category and not servers, but is it still unsafe? -- Stream based I/O. -- Left fold/callback based I/O. Perhaps I don't quite understand the concepts, but I don't see how to implement the example as elegantly and simply with these approaches? -k -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Copying Arrays
Ketil Malde wrote: I guess this is where I don't follow: why would you need more short strings for Unicode text than for ASCII or 8-bit latin text? Because ByteString is optimised for dealing with big blobs of binary data, its performance when you split a big ByteString into a pile of words is quite poor. Also, people use ByteStrings for dealing with text mostly because their native character sets let them get away with it, not because it's an intrinsically good idea. b ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Copying Arrays
I'm trying to implement some file I/O where I can read in a file to an array, but do so without having to know how much space I will need. (Before you suggest it, lists are too slow/space consuming.) I was thinking that one possible solution is to use a strategy used in dynamic arrays, where everytime you run out of space you double the capacity of the array. Trouble is, most arrays in Haskell would require you to traverse the entire array and copy each individual cell, which would make it worthless. Are there any low-level array types (MutableByteArray#, for example) that support a memcpy-like function where I could copy the entire array into the new one instead of copying it one value at a time? Is there another solution that I'm missing? -- Tom Harper MSc Computer Science '08 University of Oxford Mobile: +44 (0)7533 998 591 Skype: +1 949 273 4627 (harpertom) ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Copying Arrays
Define too slow, time-consuming? I've dealt with this problem a lot at this point, and I've been able to slurp up CSV files of several gigabytes in the same amount of time as I generally do in C. I have a feeling an array solution is just going to bog you down more, however if you insist, I would also suggest writing your I/O in C and returning a ForeignPtr to something and work from there. -- Jeff On Thu, May 29, 2008 at 12:21 PM, Tom Harper [EMAIL PROTECTED] wrote: I'm trying to implement some file I/O where I can read in a file to an array, but do so without having to know how much space I will need. (Before you suggest it, lists are too slow/space consuming.) I was thinking that one possible solution is to use a strategy used in dynamic arrays, where everytime you run out of space you double the capacity of the array. Trouble is, most arrays in Haskell would require you to traverse the entire array and copy each individual cell, which would make it worthless. Are there any low-level array types (MutableByteArray#, for example) that support a memcpy-like function where I could copy the entire array into the new one instead of copying it one value at a time? Is there another solution that I'm missing? -- Tom Harper MSc Computer Science '08 University of Oxford Mobile: +44 (0)7533 998 591 Skype: +1 949 273 4627 (harpertom) ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe -- I try to take things like a crow; war and chaos don't always ruin a picnic, they just mean you have to be careful what you swallow. -- Jessica Edwards ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Copying Arrays
Hello Tom, Thursday, May 29, 2008, 8:21:25 PM, you wrote: Are there any low-level array types (MutableByteArray#, for example) that support a memcpy-like function yes, MBA# allows to use memcpy. you can find examples of its usage right in the array package, for example: thawSTUArray :: Ix i = UArray i e - ST s (STUArray s i e) thawSTUArray (UArray l u arr#) = ST $ \s1# - case sizeofByteArray# arr# of { n# - case newByteArray# n# s1# of { (# s2#, marr# #) - case unsafeCoerce# memcpy marr# arr# n# s2# of { (# s3#, () #) - (# s3#, STUArray l u marr# #) }}} foreign import ccall unsafe memcpy memcpy :: MutableByteArray# RealWorld - ByteArray# - Int# - IO () i can give you further explanations if required -- Best regards, Bulatmailto:[EMAIL PROTECTED] ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Copying Arrays
Isn't fast IO what ByteStrings where invented for? Adrian Tom Harper schrieb: I'm trying to implement some file I/O where I can read in a file to an array, but do so without having to know how much space I will need. (Before you suggest it, lists are too slow/space consuming.) I was thinking that one possible solution is to use a strategy used in dynamic arrays, where everytime you run out of space you double the capacity of the array. Trouble is, most arrays in Haskell would require you to traverse the entire array and copy each individual cell, which would make it worthless. Are there any low-level array types (MutableByteArray#, for example) that support a memcpy-like function where I could copy the entire array into the new one instead of copying it one value at a time? Is there another solution that I'm missing? signature.asc Description: OpenPGP digital signature ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Copying Arrays
Exactly. Someone on the list gave me this example awhile back for reading CSV files. I can process a gigabyte (simple unpack and print to dev/null for IO testing purposes) in about two and a half seconds using this code. import Data.ByteString.Lazy.Char8 as C -- | Read a datafile and turn it into lists of columns readDatafileAndTranspose name = do sheet - (transpose . map (C.split '\t') . C.lines) `fmap` C.readFile name return $ foldl' go M.empty sheet where go m (x:xs) = M.insert (C.unpack x) xs m 2008/5/29 Adrian Neumann [EMAIL PROTECTED]: Isn't fast IO what ByteStrings where invented for? Adrian Tom Harper schrieb: I'm trying to implement some file I/O where I can read in a file to an array, but do so without having to know how much space I will need. (Before you suggest it, lists are too slow/space consuming.) I was thinking that one possible solution is to use a strategy used in dynamic arrays, where everytime you run out of space you double the capacity of the array. Trouble is, most arrays in Haskell would require you to traverse the entire array and copy each individual cell, which would make it worthless. Are there any low-level array types (MutableByteArray#, for example) that support a memcpy-like function where I could copy the entire array into the new one instead of copying it one value at a time? Is there another solution that I'm missing? ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe -- I try to take things like a crow; war and chaos don't always ruin a picnic, they just mean you have to be careful what you swallow. -- Jessica Edwards ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Copying Arrays
Tom Harper wrote: I'm trying to implement some file I/O where I can read in a file to an array, but do so without having to know how much space I will need. Why not just read it into a lazy ByteString? Are you looking to use an array with elements of a different type? You could then convert it to a strict ByteString. b ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Copying Arrays
Why not just read it into a lazy ByteString? Are you looking to use an array with elements of a different type? You could then convert it to a strict ByteString. b Because I'm writing the Unicode-friendly ByteString =p -- Tom Harper MSc Computer Science '08 University of Oxford Mobile: +44 (0)7533 998 591 Skype: +1 949 273 4627 (harpertom) ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Copying Arrays
Tom Harper wrote: Because I'm writing the Unicode-friendly ByteString =p Perhaps I'm not understanding. Why wouldn't you use ByteString for I/O, even if you're writing a different library? After all, ByteString's own internals use memcpy. b ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Copying Arrays
On Thu 2008-05-29 19:19, Tom Harper wrote: Why not just read it into a lazy ByteString? Are you looking to use an array with elements of a different type? You could then convert it to a strict ByteString. Because I'm writing the Unicode-friendly ByteString =p Uh, ByteString is Unicode-agnostic. ByteString.Char8 is not. So why not do IO with lazy ByteString and parse into your own representation (which might look a lot like StorableVector)? Jed pgpCfBIxeQ4l0.pgp Description: PGP signature ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Copying Arrays
On Thu, 2008-05-29 at 11:25 -0700, Bryan O'Sullivan wrote: Tom Harper wrote: Because I'm writing the Unicode-friendly ByteString =p Perhaps I'm not understanding. Why wouldn't you use ByteString for I/O, even if you're writing a different library? After all, ByteString's own internals use memcpy. Just to clarify Tom's question... He's designing a proper Unicode type along the lines of ByteString. However it does not use a ByteString underneath so we cannot re-use the IO operations from ByteString, though obviously we can steal ideas. The reason we do not want to re-use ByteString as the underlying representation is because they're not good for short strings and we expect that for Unicode text (more than arbitrary blobs of binary data) people will want efficient short strings. So instead of using a ForeignPtr to pinned heap memory (read: slow allocation, lots of fragmentation) we're looking at using unpinned heap arrays. So that means MutableByteArr# (or STUArray for the first prototype). We cannot use memcpy because it operates on raw pointers but that's no good for a movable heap object like a ByteArr#. So that's the background to the question. I think the answer is either that there's a primitive to do this copying in the GHC.* libs or we'll need to add one. Duncan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re[2]: [Haskell-cafe] Copying Arrays
Hello Duncan, Thursday, May 29, 2008, 10:57:02 PM, you wrote: We cannot use memcpy because it operates on raw pointers but that's no good for a movable heap object like a ByteArr#. it's false assumption - memcpy is used for copying non-pinned arrays as in the code from Data.Array.Base i've citated. the idea is that GC shouldn't occur between getting array address and actual memcpy call. you may consult Simon Marlow -- Best regards, Bulatmailto:[EMAIL PROTECTED] ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Copying Arrays
duncan.coutts: On Thu, 2008-05-29 at 11:25 -0700, Bryan O'Sullivan wrote: Tom Harper wrote: Because I'm writing the Unicode-friendly ByteString =p Perhaps I'm not understanding. Why wouldn't you use ByteString for I/O, even if you're writing a different library? After all, ByteString's own internals use memcpy. Just to clarify Tom's question... He's designing a proper Unicode type along the lines of ByteString. However it does not use a ByteString underneath so we cannot re-use the IO operations from ByteString, though obviously we can steal ideas. The reason we do not want to re-use ByteString as the underlying representation is because they're not good for short strings and we expect that for Unicode text (more than arbitrary blobs of binary data) people will want efficient short strings. So instead of using a ForeignPtr to pinned heap memory (read: slow allocation, lots of fragmentation) we're looking at using unpinned heap arrays. So that means MutableByteArr# (or STUArray for the first prototype). We cannot use memcpy because it operates on raw pointers but that's no good for a movable heap object like a ByteArr#. So that's the background to the question. I think the answer is either that there's a primitive to do this copying in the GHC.* libs or we'll need to add one. There are these guys, foreign import ccall unsafe __hscore_memcpy_dst_off memcpy_baoff_ba :: RawBuffer - CInt - RawBuffer - CSize - IO (Ptr ()) foreign import ccall unsafe __hscore_memcpy_dst_off memcpy_baoff_ptr :: RawBuffer - CInt - Ptr a - CSize - IO (Ptr ()) type RawBuffer = MutableByteArray# RealWorld ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe