Re: [Haskell-cafe] file splitter with enumerator package
On Tue, Jul 26, 2011 at 12:19 PM, yi huang yi.codepla...@gmail.com wrote: Actually, i'm wondering how to do exception handling and resource cleanup in iteratee, e.g. your `writer` iteratee, i found it difficult, because iteratee is designed to let enumerator manage resources. I've found the answer for myself, `catchError` and `tryIO` is for this. here is an example code: http://hpaste.org/49530#a49565 On Sat, Jul 23, 2011 at 2:41 AM, Eric Rasmussen ericrasmus...@gmail.comwrote: Hi everyone, A friend of mine recently asked if I knew of a utility to split a large file (4gb in his case) into arbitrarily-sized files on Windows. Although there are a number of file-splitting utilities, the catch was it couldn't break in the middle of a line. When the standard why don't you use Linux? response proved unhelpful, I took this as an opportunity to write my first program using the enumerator package. If anyone has time, I'm really interested in knowing if there's a better way to take the incoming stream and output it directly to a file. The basic steps I'm taking are: 1) Data.Enumerator.Binary.take -- grabs the user-specified number of bytes, then (because it returns a lazy ByteString) I use Data.ByteString.Lazy.hPut to output the chunk 2) Data.Enumerator.Binary.head -- after using take for the big chunk, it inspects and outputs individual characters and stops after it outputs the next newline character 3) I close the handle that steps 12 used to output the data and then repeat 12 with the next handle (an infinite lazy list of filepaths like part1.csv, part2.csv, and so on) The full code is pasted here: http://hpaste.org/49366, and while I'd like to get any other feedback on how to make it better, I want to note that I'm not planning to release this as a utility so I wouldn't want anyone to spend extra time performing a full code review. Thanks! Eric ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe -- http://www.yi-programmer.com/blog/ -- http://www.yi-programmer.com/blog/ ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] file splitter with enumerator package
Sorry, I'm only beginning to understand iteratees, but then how do you access each line of text output by the enumeratee lines within an iteratee? 2011/7/24 Felipe Almeida Lessa felipe.le...@gmail.com On Sun, Jul 24, 2011 at 12:28 PM, Yves Parès limestr...@gmail.com wrote: If you used Data.Enumerator.Text, you would maybe benefit the lines function: lines :: Monad m = Enumeratee Text Text m b It gets arbitrary blocks of text and outputs lines of text. But there is something I don't get with that signature: why isn't it: lines :: Monad m = Enumeratee Text [Text] m b ?? Lists of lines of text? Cheers, =) -- Felipe. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] file splitter with enumerator package
blah = do fp - openFile file ReadMode run_ $ (ET.enumHandle fp $= ET.lines) $$ printChunks True printChunks is super duper simple: printChunks printEmpty = continue loop where loop (Chunks xs) = do let hide = null xs not printEmpty CM.unless hide (liftIO (print xs)) continue loop loop EOF = do liftIO (putStrLn EOF) yield () EOF Just replace print with whatever IO action you wanted to perform. On Mon, Jul 25, 2011 at 4:31 AM, Yves Parès limestr...@gmail.com wrote: Sorry, I'm only beginning to understand iteratees, but then how do you access each line of text output by the enumeratee lines within an iteratee? 2011/7/24 Felipe Almeida Lessa felipe.le...@gmail.com On Sun, Jul 24, 2011 at 12:28 PM, Yves Parès limestr...@gmail.com wrote: If you used Data.Enumerator.Text, you would maybe benefit the lines function: lines :: Monad m = Enumeratee Text Text m b It gets arbitrary blocks of text and outputs lines of text. But there is something I don't get with that signature: why isn't it: lines :: Monad m = Enumeratee Text [Text] m b ?? Lists of lines of text? Cheers, =) -- Felipe. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] file splitter with enumerator package
Okay, so there, the chunks (xs) will be lines of Text, and not just random blocks. Isn't there a primitive like printChunks in the enumerator library, or are we forced to handle Chunks and EOF by hand? 2011/7/25 David McBride dmcbr...@neondsl.com blah = do fp - openFile file ReadMode run_ $ (ET.enumHandle fp $= ET.lines) $$ printChunks True printChunks is super duper simple: printChunks printEmpty = continue loop where loop (Chunks xs) = do let hide = null xs not printEmpty CM.unless hide (liftIO (print xs)) continue loop loop EOF = do liftIO (putStrLn EOF) yield () EOF Just replace print with whatever IO action you wanted to perform. On Mon, Jul 25, 2011 at 4:31 AM, Yves Parès limestr...@gmail.com wrote: Sorry, I'm only beginning to understand iteratees, but then how do you access each line of text output by the enumeratee lines within an iteratee? 2011/7/24 Felipe Almeida Lessa felipe.le...@gmail.com On Sun, Jul 24, 2011 at 12:28 PM, Yves Parès limestr...@gmail.com wrote: If you used Data.Enumerator.Text, you would maybe benefit the lines function: lines :: Monad m = Enumeratee Text Text m b It gets arbitrary blocks of text and outputs lines of text. But there is something I don't get with that signature: why isn't it: lines :: Monad m = Enumeratee Text [Text] m b ?? Lists of lines of text? Cheers, =) -- Felipe. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] file splitter with enumerator package
Well I was going to say: import Data.Text.IO as T import Data.Enumerator.List as EL import Data.Enumerator.Text as ET run_ $ (ET.enumHandle fp $= ET.lines) $$ EL.mapM_ T.putStrLn for example. But it turns out this actually concatenates the lines together and prints one single string at the end. The reason is because it turns out that ET.enumHandle already gets lines one by one without you asking and it doesn't add newlines to the end, so ET.lines looks at each chunk and never sees any newlines so it returns the entire thing concatenated together figuring that was an entire line. I'm kind of surprised that enumHandle fetches linewise rather than to let you handle it. But if you were to make your own enumHandle that wasn't linewise that would work. On Mon, Jul 25, 2011 at 6:26 AM, Yves Parès limestr...@gmail.com wrote: Okay, so there, the chunks (xs) will be lines of Text, and not just random blocks. Isn't there a primitive like printChunks in the enumerator library, or are we forced to handle Chunks and EOF by hand? 2011/7/25 David McBride dmcbr...@neondsl.com blah = do fp - openFile file ReadMode run_ $ (ET.enumHandle fp $= ET.lines) $$ printChunks True printChunks is super duper simple: printChunks printEmpty = continue loop where loop (Chunks xs) = do let hide = null xs not printEmpty CM.unless hide (liftIO (print xs)) continue loop loop EOF = do liftIO (putStrLn EOF) yield () EOF Just replace print with whatever IO action you wanted to perform. On Mon, Jul 25, 2011 at 4:31 AM, Yves Parès limestr...@gmail.com wrote: Sorry, I'm only beginning to understand iteratees, but then how do you access each line of text output by the enumeratee lines within an iteratee? 2011/7/24 Felipe Almeida Lessa felipe.le...@gmail.com On Sun, Jul 24, 2011 at 12:28 PM, Yves Parès limestr...@gmail.com wrote: If you used Data.Enumerator.Text, you would maybe benefit the lines function: lines :: Monad m = Enumeratee Text Text m b It gets arbitrary blocks of text and outputs lines of text. But there is something I don't get with that signature: why isn't it: lines :: Monad m = Enumeratee Text [Text] m b ?? Lists of lines of text? Cheers, =) -- Felipe. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] file splitter with enumerator package
I just found another solution that seems to work, although I don't fully understand why. In my original function where I used EB.take to strictly read in a Lazy ByteString and then L.hPut to write it out to a handle, I now use this instead (full code in the annotation here: http://hpaste.org/49366): EB.isolate bytes =$ EB.iterHandle handle It now runs at the same speed but in constant memory, which is exactly what I was looking for. Is it recommended to nest iteratees within iteratees like this? I'm surprised that it worked, but I can't see a cleaner way to do it because of the other parts of the program that complicate matters. At this point I've achieved my original goals, unusual as they are, but since this has been an interesting learning experience I don't want it to stop there if there are more idiomatic ways to write code with the enumerator package. On Mon, Jul 25, 2011 at 4:06 AM, David McBride dmcbr...@neondsl.com wrote: Well I was going to say: import Data.Text.IO as T import Data.Enumerator.List as EL import Data.Enumerator.Text as ET run_ $ (ET.enumHandle fp $= ET.lines) $$ EL.mapM_ T.putStrLn for example. But it turns out this actually concatenates the lines together and prints one single string at the end. The reason is because it turns out that ET.enumHandle already gets lines one by one without you asking and it doesn't add newlines to the end, so ET.lines looks at each chunk and never sees any newlines so it returns the entire thing concatenated together figuring that was an entire line. I'm kind of surprised that enumHandle fetches linewise rather than to let you handle it. But if you were to make your own enumHandle that wasn't linewise that would work. On Mon, Jul 25, 2011 at 6:26 AM, Yves Parès limestr...@gmail.com wrote: Okay, so there, the chunks (xs) will be lines of Text, and not just random blocks. Isn't there a primitive like printChunks in the enumerator library, or are we forced to handle Chunks and EOF by hand? 2011/7/25 David McBride dmcbr...@neondsl.com blah = do fp - openFile file ReadMode run_ $ (ET.enumHandle fp $= ET.lines) $$ printChunks True printChunks is super duper simple: printChunks printEmpty = continue loop where loop (Chunks xs) = do let hide = null xs not printEmpty CM.unless hide (liftIO (print xs)) continue loop loop EOF = do liftIO (putStrLn EOF) yield () EOF Just replace print with whatever IO action you wanted to perform. On Mon, Jul 25, 2011 at 4:31 AM, Yves Parès limestr...@gmail.com wrote: Sorry, I'm only beginning to understand iteratees, but then how do you access each line of text output by the enumeratee lines within an iteratee? 2011/7/24 Felipe Almeida Lessa felipe.le...@gmail.com On Sun, Jul 24, 2011 at 12:28 PM, Yves Parès limestr...@gmail.com wrote: If you used Data.Enumerator.Text, you would maybe benefit the lines function: lines :: Monad m = Enumeratee Text Text m b It gets arbitrary blocks of text and outputs lines of text. But there is something I don't get with that signature: why isn't it: lines :: Monad m = Enumeratee Text [Text] m b ?? Lists of lines of text? Cheers, =) -- Felipe. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] file splitter with enumerator package
I feel like there is a little bit better way to code this by splitting the file outputting part from the part that counts and checks for newlines like so: run_ $ (EB.enumFile file.txt $= toChunksnl 4096) $$ toFiles filelist toFiles [] = error expected infinite file list toFiles (f:fs) = do next - EL.head case next of Nothing - return () Just next' - do liftIO $ L.writeFile f next' toFiles fs toChunksnl n = EL.concatMapAccum (somefunc n) L.empty where somefunc :: Int - L.ByteString - B.ByteString - (L.ByteString, [L.ByteString]) somefunc = undefined Where it has an accumulator that starts empty, gets a new bytestring, then parses the concatenation of those two that into as many full chunks that end with a newline as it can and stores that in the second part of the pair and then whatever remains unterminated ends up as the first part. I tried to write it myself, but I can't seem to hit all the edge cases necessary, but it seems like it should be doable for someone who wants to. It would be trivial with strings, but with bytestrings it requires a little elbow grease. However as to your question on whether you should use iteratees inside other iteratees, yes of course. It is all composeable. On Mon, Jul 25, 2011 at 1:38 PM, Eric Rasmussen ericrasmus...@gmail.com wrote: I just found another solution that seems to work, although I don't fully understand why. In my original function where I used EB.take to strictly read in a Lazy ByteString and then L.hPut to write it out to a handle, I now use this instead (full code in the annotation here: http://hpaste.org/49366): EB.isolate bytes =$ EB.iterHandle handle It now runs at the same speed but in constant memory, which is exactly what I was looking for. Is it recommended to nest iteratees within iteratees like this? I'm surprised that it worked, but I can't see a cleaner way to do it because of the other parts of the program that complicate matters. At this point I've achieved my original goals, unusual as they are, but since this has been an interesting learning experience I don't want it to stop there if there are more idiomatic ways to write code with the enumerator package. On Mon, Jul 25, 2011 at 4:06 AM, David McBride dmcbr...@neondsl.com wrote: Well I was going to say: import Data.Text.IO as T import Data.Enumerator.List as EL import Data.Enumerator.Text as ET run_ $ (ET.enumHandle fp $= ET.lines) $$ EL.mapM_ T.putStrLn for example. But it turns out this actually concatenates the lines together and prints one single string at the end. The reason is because it turns out that ET.enumHandle already gets lines one by one without you asking and it doesn't add newlines to the end, so ET.lines looks at each chunk and never sees any newlines so it returns the entire thing concatenated together figuring that was an entire line. I'm kind of surprised that enumHandle fetches linewise rather than to let you handle it. But if you were to make your own enumHandle that wasn't linewise that would work. On Mon, Jul 25, 2011 at 6:26 AM, Yves Parès limestr...@gmail.com wrote: Okay, so there, the chunks (xs) will be lines of Text, and not just random blocks. Isn't there a primitive like printChunks in the enumerator library, or are we forced to handle Chunks and EOF by hand? 2011/7/25 David McBride dmcbr...@neondsl.com blah = do fp - openFile file ReadMode run_ $ (ET.enumHandle fp $= ET.lines) $$ printChunks True printChunks is super duper simple: printChunks printEmpty = continue loop where loop (Chunks xs) = do let hide = null xs not printEmpty CM.unless hide (liftIO (print xs)) continue loop loop EOF = do liftIO (putStrLn EOF) yield () EOF Just replace print with whatever IO action you wanted to perform. On Mon, Jul 25, 2011 at 4:31 AM, Yves Parès limestr...@gmail.com wrote: Sorry, I'm only beginning to understand iteratees, but then how do you access each line of text output by the enumeratee lines within an iteratee? 2011/7/24 Felipe Almeida Lessa felipe.le...@gmail.com On Sun, Jul 24, 2011 at 12:28 PM, Yves Parès limestr...@gmail.com wrote: If you used Data.Enumerator.Text, you would maybe benefit the lines function: lines :: Monad m = Enumeratee Text Text m b It gets arbitrary blocks of text and outputs lines of text. But there is something I don't get with that signature: why isn't it: lines :: Monad m = Enumeratee Text [Text] m b ?? Lists of lines of text? Cheers, =) -- Felipe. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] file splitter with enumerator package
Actually, i'm wondering how to do exception handling and resource cleanup in iteratee, e.g. your `writer` iteratee, i found it difficult, because iteratee is designed to let enumerator manage resources. On Sat, Jul 23, 2011 at 2:41 AM, Eric Rasmussen ericrasmus...@gmail.comwrote: Hi everyone, A friend of mine recently asked if I knew of a utility to split a large file (4gb in his case) into arbitrarily-sized files on Windows. Although there are a number of file-splitting utilities, the catch was it couldn't break in the middle of a line. When the standard why don't you use Linux? response proved unhelpful, I took this as an opportunity to write my first program using the enumerator package. If anyone has time, I'm really interested in knowing if there's a better way to take the incoming stream and output it directly to a file. The basic steps I'm taking are: 1) Data.Enumerator.Binary.take -- grabs the user-specified number of bytes, then (because it returns a lazy ByteString) I use Data.ByteString.Lazy.hPut to output the chunk 2) Data.Enumerator.Binary.head -- after using take for the big chunk, it inspects and outputs individual characters and stops after it outputs the next newline character 3) I close the handle that steps 12 used to output the data and then repeat 12 with the next handle (an infinite lazy list of filepaths like part1.csv, part2.csv, and so on) The full code is pasted here: http://hpaste.org/49366, and while I'd like to get any other feedback on how to make it better, I want to note that I'm not planning to release this as a utility so I wouldn't want anyone to spend extra time performing a full code review. Thanks! Eric ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe -- http://www.yi-programmer.com/blog/ ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] file splitter with enumerator package
If you used Data.Enumerator.Text, you would maybe benefit the lines function: lines :: Monad m = Enumeratee Text Text m b But there is something I don't get with that signature: why isn't it: lines :: Monad m = Enumeratee Text [Text] m b ?? 2011/7/23 Eric Rasmussen ericrasmus...@gmail.com Hi Felipe, Thank you for the very detailed explanation and help. Regarding the first point, for this particular use case it's fine if the user-specified file size is extended by the length of a partial line (it's a compact csv file so if the user breaks a big file into 100mb chunks, each chunk would only ever be about 100mb + up to 80 bytes, which is fine for the user). I'm intrigued by the idea of making the bulk copy function with EB.isolate and EB.iterHandle, but I couldn't find a way to fit these into the larger context of writing to multiple file handles. I'll keep working on it and see if I can address the concerns you brought up. Thanks again! Eric On Fri, Jul 22, 2011 at 6:00 PM, Felipe Almeida Lessa felipe.le...@gmail.com wrote: There is one problem with your algorithm. If the user asks for 4 GiB, then the program will create files with *at least* 4 GiB. So the user would need to ask for less, maybe 3.9 GiB. Even so there's some danger, because there could be a 0.11 GiB line on the file. Now, the biggest problem your code won't run in constant memory. 'EB.take' does not lazily return a lazy ByteString. It strictly returns a lazy ByteString [1]. The lazy ByteString is used to avoid copying data (as it is basically the same as a linked list of strict bytestrings). So if the user asked for 4 GiB files, this program would need at least 4 GiB of memory, probably more due to overheads. If you want to use lazy lazy ByteStrings (lazy ByteStrings with lazy I/O, as oposed to lazy ByteStrings with strict I/O), the enumerator package doesn't really buy you anything. You should just use bytestring package's lazy I/O functions. If you want the guarantee of no leaks that enumerator gives, then you have to use another way of constructing your program. One safe way of doing it is something like: takeNextLine :: E.Iteratee B.ByteString m (Maybe L.ByteString) takeNextLine = ... go :: Monad m = Handle - Int64 - E.Iteratee B.ByteString m (Maybe L.ByteString) go h n = do mline - takeNextLine case mline of Nothing - return Nothing Just line | L.length line = n - L.hPut h line go h (n - L.length line) | otherwise - return mline So 'go h n' is the iteratee that saves at most 'n' bytes in handle 'h' and returns the leftover data. The driver code needs to check its results. Case 'Nothing', then the program finishes. Case 'Just line', save line on a new file and call 'go h2 (n - L.length line)'. It isn't efficient because lines could be small, resulting in many small hPuts (bad). But it is correct and will never use more than 'n' bytes (great). You could also have some compromise where the user says that he'll never have lines longer than 'x' bytes (say, 1 MiB). Then you call a bulk copy function for 'n - x' bytes, and then call 'go h x'. I think you can make the bulk copy function with EB.isolate and EB.iterHandle. Cheers, =) [1] http://hackage.haskell.org/packages/archive/enumerator/0.4.13.1/doc/html/src/Data-Enumerator-Binary.html#take -- Felipe. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] file splitter with enumerator package
On Sun, Jul 24, 2011 at 12:28 PM, Yves Parès limestr...@gmail.com wrote: If you used Data.Enumerator.Text, you would maybe benefit the lines function: lines :: Monad m = Enumeratee Text Text m b It gets arbitrary blocks of text and outputs lines of text. But there is something I don't get with that signature: why isn't it: lines :: Monad m = Enumeratee Text [Text] m b ?? Lists of lines of text? Cheers, =) -- Felipe. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] file splitter with enumerator package
Since the program only needs to finish a line after it's made a bulk copy of a potentially large chunk of a file (could be 25 - 500 mb), I was hoping to find a way to copy the large chunk in constant memory and without inspecting the individual bytes/characters. I'm still having some difficulty with this part if anyone has suggestions. Thanks again, Eric On Sun, Jul 24, 2011 at 10:34 AM, Felipe Almeida Lessa felipe.le...@gmail.com wrote: On Sun, Jul 24, 2011 at 12:28 PM, Yves Parès limestr...@gmail.com wrote: If you used Data.Enumerator.Text, you would maybe benefit the lines function: lines :: Monad m = Enumeratee Text Text m b It gets arbitrary blocks of text and outputs lines of text. But there is something I don't get with that signature: why isn't it: lines :: Monad m = Enumeratee Text [Text] m b ?? Lists of lines of text? Cheers, =) -- Felipe. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] file splitter with enumerator package
Hi everyone, A friend of mine recently asked if I knew of a utility to split a large file (4gb in his case) into arbitrarily-sized files on Windows. Although there are a number of file-splitting utilities, the catch was it couldn't break in the middle of a line. When the standard why don't you use Linux? response proved unhelpful, I took this as an opportunity to write my first program using the enumerator package. If anyone has time, I'm really interested in knowing if there's a better way to take the incoming stream and output it directly to a file. The basic steps I'm taking are: 1) Data.Enumerator.Binary.take -- grabs the user-specified number of bytes, then (because it returns a lazy ByteString) I use Data.ByteString.Lazy.hPut to output the chunk 2) Data.Enumerator.Binary.head -- after using take for the big chunk, it inspects and outputs individual characters and stops after it outputs the next newline character 3) I close the handle that steps 12 used to output the data and then repeat 12 with the next handle (an infinite lazy list of filepaths like part1.csv, part2.csv, and so on) The full code is pasted here: http://hpaste.org/49366, and while I'd like to get any other feedback on how to make it better, I want to note that I'm not planning to release this as a utility so I wouldn't want anyone to spend extra time performing a full code review. Thanks! Eric ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] file splitter with enumerator package
There is one problem with your algorithm. If the user asks for 4 GiB, then the program will create files with *at least* 4 GiB. So the user would need to ask for less, maybe 3.9 GiB. Even so there's some danger, because there could be a 0.11 GiB line on the file. Now, the biggest problem your code won't run in constant memory. 'EB.take' does not lazily return a lazy ByteString. It strictly returns a lazy ByteString [1]. The lazy ByteString is used to avoid copying data (as it is basically the same as a linked list of strict bytestrings). So if the user asked for 4 GiB files, this program would need at least 4 GiB of memory, probably more due to overheads. If you want to use lazy lazy ByteStrings (lazy ByteStrings with lazy I/O, as oposed to lazy ByteStrings with strict I/O), the enumerator package doesn't really buy you anything. You should just use bytestring package's lazy I/O functions. If you want the guarantee of no leaks that enumerator gives, then you have to use another way of constructing your program. One safe way of doing it is something like: takeNextLine :: E.Iteratee B.ByteString m (Maybe L.ByteString) takeNextLine = ... go :: Monad m = Handle - Int64 - E.Iteratee B.ByteString m (Maybe L.ByteString) go h n = do mline - takeNextLine case mline of Nothing - return Nothing Just line | L.length line = n - L.hPut h line go h (n - L.length line) | otherwise - return mline So 'go h n' is the iteratee that saves at most 'n' bytes in handle 'h' and returns the leftover data. The driver code needs to check its results. Case 'Nothing', then the program finishes. Case 'Just line', save line on a new file and call 'go h2 (n - L.length line)'. It isn't efficient because lines could be small, resulting in many small hPuts (bad). But it is correct and will never use more than 'n' bytes (great). You could also have some compromise where the user says that he'll never have lines longer than 'x' bytes (say, 1 MiB). Then you call a bulk copy function for 'n - x' bytes, and then call 'go h x'. I think you can make the bulk copy function with EB.isolate and EB.iterHandle. Cheers, =) [1] http://hackage.haskell.org/packages/archive/enumerator/0.4.13.1/doc/html/src/Data-Enumerator-Binary.html#take -- Felipe. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] file splitter with enumerator package
Hi Felipe, Thank you for the very detailed explanation and help. Regarding the first point, for this particular use case it's fine if the user-specified file size is extended by the length of a partial line (it's a compact csv file so if the user breaks a big file into 100mb chunks, each chunk would only ever be about 100mb + up to 80 bytes, which is fine for the user). I'm intrigued by the idea of making the bulk copy function with EB.isolate and EB.iterHandle, but I couldn't find a way to fit these into the larger context of writing to multiple file handles. I'll keep working on it and see if I can address the concerns you brought up. Thanks again! Eric On Fri, Jul 22, 2011 at 6:00 PM, Felipe Almeida Lessa felipe.le...@gmail.com wrote: There is one problem with your algorithm. If the user asks for 4 GiB, then the program will create files with *at least* 4 GiB. So the user would need to ask for less, maybe 3.9 GiB. Even so there's some danger, because there could be a 0.11 GiB line on the file. Now, the biggest problem your code won't run in constant memory. 'EB.take' does not lazily return a lazy ByteString. It strictly returns a lazy ByteString [1]. The lazy ByteString is used to avoid copying data (as it is basically the same as a linked list of strict bytestrings). So if the user asked for 4 GiB files, this program would need at least 4 GiB of memory, probably more due to overheads. If you want to use lazy lazy ByteStrings (lazy ByteStrings with lazy I/O, as oposed to lazy ByteStrings with strict I/O), the enumerator package doesn't really buy you anything. You should just use bytestring package's lazy I/O functions. If you want the guarantee of no leaks that enumerator gives, then you have to use another way of constructing your program. One safe way of doing it is something like: takeNextLine :: E.Iteratee B.ByteString m (Maybe L.ByteString) takeNextLine = ... go :: Monad m = Handle - Int64 - E.Iteratee B.ByteString m (Maybe L.ByteString) go h n = do mline - takeNextLine case mline of Nothing - return Nothing Just line | L.length line = n - L.hPut h line go h (n - L.length line) | otherwise - return mline So 'go h n' is the iteratee that saves at most 'n' bytes in handle 'h' and returns the leftover data. The driver code needs to check its results. Case 'Nothing', then the program finishes. Case 'Just line', save line on a new file and call 'go h2 (n - L.length line)'. It isn't efficient because lines could be small, resulting in many small hPuts (bad). But it is correct and will never use more than 'n' bytes (great). You could also have some compromise where the user says that he'll never have lines longer than 'x' bytes (say, 1 MiB). Then you call a bulk copy function for 'n - x' bytes, and then call 'go h x'. I think you can make the bulk copy function with EB.isolate and EB.iterHandle. Cheers, =) [1] http://hackage.haskell.org/packages/archive/enumerator/0.4.13.1/doc/html/src/Data-Enumerator-Binary.html#take -- Felipe. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe