If you used Data.Enumerator.Text, you would maybe benefit the "lines" function:
lines :: Monad m => Enumeratee Text Text m b But there is something I don't get with that signature: why isn't it: lines :: Monad m => Enumeratee Text [Text] m b ?? 2011/7/23 Eric Rasmussen <[email protected]> > Hi Felipe, > > Thank you for the very detailed explanation and help. Regarding the first > point, for this particular use case it's fine if the user-specified file > size is extended by the length of a partial line (it's a compact csv file so > if the user breaks a big file into 100mb chunks, each chunk would only ever > be about 100mb + up to 80 bytes, which is fine for the user). > > I'm intrigued by the idea of making the bulk copy function with EB.isolate > and EB.iterHandle, but I couldn't find a way to fit these into the larger > context of writing to multiple file handles. I'll keep working on it and see > if I can address the concerns you brought up. > > Thanks again! > Eric > > > > > > On Fri, Jul 22, 2011 at 6:00 PM, Felipe Almeida Lessa < > [email protected]> wrote: > >> There is one problem with your algorithm. If the user asks for 4 GiB, >> then the program will create files with *at least* 4 GiB. So the user >> would need to ask for less, maybe 3.9 GiB. Even so there's some >> danger, because there could be a 0.11 GiB line on the file. >> >> Now, the biggest problem your code won't run in constant memory. >> 'EB.take' does not lazily return a lazy ByteString. It strictly >> returns a lazy ByteString [1]. The lazy ByteString is used to avoid >> copying data (as it is basically the same as a linked list of strict >> bytestrings). So if the user asked for 4 GiB files, this program >> would need at least 4 GiB of memory, probably more due to overheads. >> >> If you want to use lazy lazy ByteStrings (lazy ByteStrings with lazy >> I/O, as oposed to lazy ByteStrings with strict I/O), the enumerator >> package doesn't really buy you anything. You should just use >> bytestring package's lazy I/O functions. >> >> If you want the guarantee of no leaks that enumerator gives, then you >> have to use another way of constructing your program. One safe way of >> doing it is something like: >> >> takeNextLine :: E.Iteratee B.ByteString m (Maybe L.ByteString) >> takeNextLine = ... >> >> go :: Monad m => Handle -> Int64 -> E.Iteratee B.ByteString m (Maybe >> L.ByteString) >> go h n = do >> mline <- takeNextLine >> case mline of >> Nothing -> return Nothing >> Just line >> | L.length line <= n -> L.hPut h line >> go h (n - L.length line) >> | otherwise -> return mline >> >> So 'go h n' is the iteratee that saves at most 'n' bytes in handle 'h' >> and returns the leftover data. The driver code needs to check its >> results. Case 'Nothing', then the program finishes. Case 'Just >> line', save line on a new file and call 'go h2 (n - L.length line)'. >> It isn't efficient because lines could be small, resulting in many >> small hPuts (bad). But it is correct and will never use more than 'n' >> bytes (great). You could also have some compromise where the user >> says that he'll never have lines longer than 'x' bytes (say, 1 MiB). >> Then you call a bulk copy function for 'n - x' bytes, and then call >> 'go h x'. I think you can make the bulk copy function with EB.isolate >> and EB.iterHandle. >> >> Cheers, =) >> >> [1] >> http://hackage.haskell.org/packages/archive/enumerator/0.4.13.1/doc/html/src/Data-Enumerator-Binary.html#take >> >> -- >> Felipe. >> > > > _______________________________________________ > Haskell-Cafe mailing list > [email protected] > http://www.haskell.org/mailman/listinfo/haskell-cafe > >
_______________________________________________ Haskell-Cafe mailing list [email protected] http://www.haskell.org/mailman/listinfo/haskell-cafe
