Re: [Haskell-cafe] file splitter with enumerator package

2011-07-26 Thread yi huang
On Tue, Jul 26, 2011 at 12:19 PM, yi huang yi.codepla...@gmail.com wrote:

 Actually, i'm wondering how to do exception handling and resource cleanup
 in iteratee, e.g. your `writer` iteratee, i found it difficult, because
 iteratee is designed to let enumerator manage resources.


I've found the answer for myself,  `catchError` and `tryIO` is for this.
here is an example code: http://hpaste.org/49530#a49565



 On Sat, Jul 23, 2011 at 2:41 AM, Eric Rasmussen 
 ericrasmus...@gmail.comwrote:

 Hi everyone,

 A friend of mine recently asked if I knew of a utility to split a
 large file (4gb in his case) into arbitrarily-sized files on Windows.
 Although there are a number of file-splitting utilities, the catch was
 it couldn't break in the middle of a line. When the standard why
 don't you use Linux? response proved unhelpful, I took this as an
 opportunity to write my first program using the enumerator package.

 If anyone has time, I'm really interested in knowing if there's a
 better way to take the incoming stream and output it directly to a
 file. The basic steps I'm taking are:

 1) Data.Enumerator.Binary.take -- grabs the user-specified number of
 bytes, then (because it returns a lazy ByteString) I use
 Data.ByteString.Lazy.hPut to output the chunk
 2) Data.Enumerator.Binary.head -- after using take for the big chunk,
 it inspects and outputs individual characters and stops after it
 outputs the next newline character
 3) I close the handle that steps 12 used to output the data and then
 repeat 12 with the next handle (an infinite lazy list of filepaths
 like part1.csv, part2.csv, and so on)

 The full code is pasted here: http://hpaste.org/49366, and while I'd
 like to get any other feedback on how to make it better, I want to
 note that I'm not planning to release this as a utility so I wouldn't
 want anyone to spend extra time performing a full code review.

 Thanks!
 Eric

 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe




 --
 http://www.yi-programmer.com/blog/




-- 
http://www.yi-programmer.com/blog/
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] file splitter with enumerator package

2011-07-25 Thread Yves Parès
Sorry, I'm only beginning to understand iteratees, but then how do you
access each line of text output by the enumeratee lines within an
iteratee?

2011/7/24 Felipe Almeida Lessa felipe.le...@gmail.com

 On Sun, Jul 24, 2011 at 12:28 PM, Yves Parès limestr...@gmail.com wrote:
  If you used Data.Enumerator.Text, you would maybe benefit the lines
  function:
 
  lines :: Monad m = Enumeratee Text Text m b

 It gets arbitrary blocks of text and outputs lines of text.

  But there is something I don't get with that signature:
  why isn't it:
  lines :: Monad m = Enumeratee Text [Text] m b
  ??

 Lists of lines of text?

 Cheers, =)

 --
 Felipe.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] file splitter with enumerator package

2011-07-25 Thread David McBride
blah = do
  fp - openFile file ReadMode
  run_ $ (ET.enumHandle fp $= ET.lines) $$ printChunks True

printChunks is super duper simple:

printChunks printEmpty = continue loop where
loop (Chunks xs) = do
let hide = null xs  not printEmpty
CM.unless hide (liftIO (print xs))
continue loop

loop EOF = do
liftIO (putStrLn EOF)
yield () EOF

Just replace print with whatever IO action you wanted to perform.

On Mon, Jul 25, 2011 at 4:31 AM, Yves Parès limestr...@gmail.com wrote:
 Sorry, I'm only beginning to understand iteratees, but then how do you
 access each line of text output by the enumeratee lines within an
 iteratee?

 2011/7/24 Felipe Almeida Lessa felipe.le...@gmail.com

 On Sun, Jul 24, 2011 at 12:28 PM, Yves Parès limestr...@gmail.com wrote:
  If you used Data.Enumerator.Text, you would maybe benefit the lines
  function:
 
  lines :: Monad m = Enumeratee Text Text m b

 It gets arbitrary blocks of text and outputs lines of text.

  But there is something I don't get with that signature:
  why isn't it:
  lines :: Monad m = Enumeratee Text [Text] m b
  ??

 Lists of lines of text?

 Cheers, =)

 --
 Felipe.


 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe



___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] file splitter with enumerator package

2011-07-25 Thread Yves Parès
Okay, so there, the chunks (xs) will be lines of Text, and not just random
blocks.
Isn't there a primitive like printChunks in the enumerator library, or are
we forced to handle Chunks and EOF by hand?

2011/7/25 David McBride dmcbr...@neondsl.com

 blah = do
  fp - openFile file ReadMode
  run_ $ (ET.enumHandle fp $= ET.lines) $$ printChunks True

 printChunks is super duper simple:

 printChunks printEmpty = continue loop where
loop (Chunks xs) = do
let hide = null xs  not printEmpty
CM.unless hide (liftIO (print xs))
continue loop

loop EOF = do
liftIO (putStrLn EOF)
yield () EOF

 Just replace print with whatever IO action you wanted to perform.

 On Mon, Jul 25, 2011 at 4:31 AM, Yves Parès limestr...@gmail.com wrote:
  Sorry, I'm only beginning to understand iteratees, but then how do you
  access each line of text output by the enumeratee lines within an
  iteratee?
 
  2011/7/24 Felipe Almeida Lessa felipe.le...@gmail.com
 
  On Sun, Jul 24, 2011 at 12:28 PM, Yves Parès limestr...@gmail.com
 wrote:
   If you used Data.Enumerator.Text, you would maybe benefit the lines
   function:
  
   lines :: Monad m = Enumeratee Text Text m b
 
  It gets arbitrary blocks of text and outputs lines of text.
 
   But there is something I don't get with that signature:
   why isn't it:
   lines :: Monad m = Enumeratee Text [Text] m b
   ??
 
  Lists of lines of text?
 
  Cheers, =)
 
  --
  Felipe.
 
 
  ___
  Haskell-Cafe mailing list
  Haskell-Cafe@haskell.org
  http://www.haskell.org/mailman/listinfo/haskell-cafe
 
 

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] file splitter with enumerator package

2011-07-25 Thread David McBride
Well I was going to say:

import Data.Text.IO as T
import Data.Enumerator.List as EL
import Data.Enumerator.Text as ET

run_ $ (ET.enumHandle fp $= ET.lines) $$ EL.mapM_ T.putStrLn

for example.  But it turns out this actually concatenates the lines
together and prints one single string at the end.  The reason is
because it turns out that ET.enumHandle already gets lines one by one
without you asking and it doesn't add newlines to the end, so ET.lines
looks at each chunk and never sees any newlines so it returns the
entire thing concatenated together figuring that was an entire line.
I'm kind of surprised that enumHandle fetches linewise rather than to
let you handle it.

But if you were to make your own enumHandle that wasn't linewise that
would work.

On Mon, Jul 25, 2011 at 6:26 AM, Yves Parès limestr...@gmail.com wrote:
 Okay, so there, the chunks (xs) will be lines of Text, and not just random
 blocks.
 Isn't there a primitive like printChunks in the enumerator library, or are
 we forced to handle Chunks and EOF by hand?

 2011/7/25 David McBride dmcbr...@neondsl.com

 blah = do
  fp - openFile file ReadMode
  run_ $ (ET.enumHandle fp $= ET.lines) $$ printChunks True

 printChunks is super duper simple:

 printChunks printEmpty = continue loop where
        loop (Chunks xs) = do
                let hide = null xs  not printEmpty
                CM.unless hide (liftIO (print xs))
                continue loop

        loop EOF = do
                liftIO (putStrLn EOF)
                yield () EOF

 Just replace print with whatever IO action you wanted to perform.

 On Mon, Jul 25, 2011 at 4:31 AM, Yves Parès limestr...@gmail.com wrote:
  Sorry, I'm only beginning to understand iteratees, but then how do you
  access each line of text output by the enumeratee lines within an
  iteratee?
 
  2011/7/24 Felipe Almeida Lessa felipe.le...@gmail.com
 
  On Sun, Jul 24, 2011 at 12:28 PM, Yves Parès limestr...@gmail.com
  wrote:
   If you used Data.Enumerator.Text, you would maybe benefit the lines
   function:
  
   lines :: Monad m = Enumeratee Text Text m b
 
  It gets arbitrary blocks of text and outputs lines of text.
 
   But there is something I don't get with that signature:
   why isn't it:
   lines :: Monad m = Enumeratee Text [Text] m b
   ??
 
  Lists of lines of text?
 
  Cheers, =)
 
  --
  Felipe.
 
 
  ___
  Haskell-Cafe mailing list
  Haskell-Cafe@haskell.org
  http://www.haskell.org/mailman/listinfo/haskell-cafe
 
 



___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] file splitter with enumerator package

2011-07-25 Thread Eric Rasmussen
I just found another solution that seems to work, although I don't
fully understand why. In my original function where I used EB.take to
strictly read in a Lazy ByteString and then L.hPut to write it out to
a handle, I now use this instead (full code in the annotation here:
http://hpaste.org/49366):

EB.isolate bytes =$ EB.iterHandle handle

It now runs at the same speed but in constant memory, which is exactly
what I was looking for. Is it recommended to nest iteratees within
iteratees like this? I'm surprised that it worked, but I can't see a
cleaner way to do it because of the other parts of the program that
complicate matters. At this point I've achieved my original goals,
unusual as they are, but since this has been an interesting learning
experience I don't want it to stop there if there are more idiomatic
ways to write code with the enumerator package.

On Mon, Jul 25, 2011 at 4:06 AM, David McBride dmcbr...@neondsl.com wrote:
 Well I was going to say:

 import Data.Text.IO as T
 import Data.Enumerator.List as EL
 import Data.Enumerator.Text as ET

 run_ $ (ET.enumHandle fp $= ET.lines) $$ EL.mapM_ T.putStrLn

 for example.  But it turns out this actually concatenates the lines
 together and prints one single string at the end.  The reason is
 because it turns out that ET.enumHandle already gets lines one by one
 without you asking and it doesn't add newlines to the end, so ET.lines
 looks at each chunk and never sees any newlines so it returns the
 entire thing concatenated together figuring that was an entire line.
 I'm kind of surprised that enumHandle fetches linewise rather than to
 let you handle it.

 But if you were to make your own enumHandle that wasn't linewise that
 would work.

 On Mon, Jul 25, 2011 at 6:26 AM, Yves Parès limestr...@gmail.com wrote:
 Okay, so there, the chunks (xs) will be lines of Text, and not just random
 blocks.
 Isn't there a primitive like printChunks in the enumerator library, or are
 we forced to handle Chunks and EOF by hand?

 2011/7/25 David McBride dmcbr...@neondsl.com

 blah = do
  fp - openFile file ReadMode
  run_ $ (ET.enumHandle fp $= ET.lines) $$ printChunks True

 printChunks is super duper simple:

 printChunks printEmpty = continue loop where
        loop (Chunks xs) = do
                let hide = null xs  not printEmpty
                CM.unless hide (liftIO (print xs))
                continue loop

        loop EOF = do
                liftIO (putStrLn EOF)
                yield () EOF

 Just replace print with whatever IO action you wanted to perform.

 On Mon, Jul 25, 2011 at 4:31 AM, Yves Parès limestr...@gmail.com wrote:
  Sorry, I'm only beginning to understand iteratees, but then how do you
  access each line of text output by the enumeratee lines within an
  iteratee?
 
  2011/7/24 Felipe Almeida Lessa felipe.le...@gmail.com
 
  On Sun, Jul 24, 2011 at 12:28 PM, Yves Parès limestr...@gmail.com
  wrote:
   If you used Data.Enumerator.Text, you would maybe benefit the lines
   function:
  
   lines :: Monad m = Enumeratee Text Text m b
 
  It gets arbitrary blocks of text and outputs lines of text.
 
   But there is something I don't get with that signature:
   why isn't it:
   lines :: Monad m = Enumeratee Text [Text] m b
   ??
 
  Lists of lines of text?
 
  Cheers, =)
 
  --
  Felipe.
 
 
  ___
  Haskell-Cafe mailing list
  Haskell-Cafe@haskell.org
  http://www.haskell.org/mailman/listinfo/haskell-cafe
 
 



 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] file splitter with enumerator package

2011-07-25 Thread David McBride
I feel like there is a little bit better way to code this by splitting
the file outputting part from the part that counts and checks for
newlines like so:

run_ $ (EB.enumFile file.txt $= toChunksnl 4096) $$ toFiles filelist

toFiles [] = error expected infinite file list
toFiles (f:fs) = do
  next - EL.head
  case next of
Nothing - return ()
Just next' - do
  liftIO $ L.writeFile f next'
  toFiles fs

toChunksnl n = EL.concatMapAccum (somefunc n) L.empty
  where
somefunc :: Int - L.ByteString - B.ByteString - (L.ByteString,
[L.ByteString])
somefunc = undefined

Where it has an accumulator that starts empty, gets a new bytestring,
then parses the concatenation of those two that into as many full
chunks that end with a newline as it can and stores that in the second
part of the pair and then whatever remains unterminated ends up as the
first part.  I tried to write it myself, but I can't seem to hit all
the edge cases necessary, but it seems like it should be doable for
someone who wants to.  It would be trivial with strings, but with
bytestrings it requires a little elbow grease.

However as to your question on whether you should use iteratees inside
other iteratees, yes of course.  It is all composeable.

On Mon, Jul 25, 2011 at 1:38 PM, Eric Rasmussen ericrasmus...@gmail.com wrote:
 I just found another solution that seems to work, although I don't
 fully understand why. In my original function where I used EB.take to
 strictly read in a Lazy ByteString and then L.hPut to write it out to
 a handle, I now use this instead (full code in the annotation here:
 http://hpaste.org/49366):

 EB.isolate bytes =$ EB.iterHandle handle

 It now runs at the same speed but in constant memory, which is exactly
 what I was looking for. Is it recommended to nest iteratees within
 iteratees like this? I'm surprised that it worked, but I can't see a
 cleaner way to do it because of the other parts of the program that
 complicate matters. At this point I've achieved my original goals,
 unusual as they are, but since this has been an interesting learning
 experience I don't want it to stop there if there are more idiomatic
 ways to write code with the enumerator package.

 On Mon, Jul 25, 2011 at 4:06 AM, David McBride dmcbr...@neondsl.com wrote:
 Well I was going to say:

 import Data.Text.IO as T
 import Data.Enumerator.List as EL
 import Data.Enumerator.Text as ET

 run_ $ (ET.enumHandle fp $= ET.lines) $$ EL.mapM_ T.putStrLn

 for example.  But it turns out this actually concatenates the lines
 together and prints one single string at the end.  The reason is
 because it turns out that ET.enumHandle already gets lines one by one
 without you asking and it doesn't add newlines to the end, so ET.lines
 looks at each chunk and never sees any newlines so it returns the
 entire thing concatenated together figuring that was an entire line.
 I'm kind of surprised that enumHandle fetches linewise rather than to
 let you handle it.

 But if you were to make your own enumHandle that wasn't linewise that
 would work.

 On Mon, Jul 25, 2011 at 6:26 AM, Yves Parès limestr...@gmail.com wrote:
 Okay, so there, the chunks (xs) will be lines of Text, and not just random
 blocks.
 Isn't there a primitive like printChunks in the enumerator library, or are
 we forced to handle Chunks and EOF by hand?

 2011/7/25 David McBride dmcbr...@neondsl.com

 blah = do
  fp - openFile file ReadMode
  run_ $ (ET.enumHandle fp $= ET.lines) $$ printChunks True

 printChunks is super duper simple:

 printChunks printEmpty = continue loop where
        loop (Chunks xs) = do
                let hide = null xs  not printEmpty
                CM.unless hide (liftIO (print xs))
                continue loop

        loop EOF = do
                liftIO (putStrLn EOF)
                yield () EOF

 Just replace print with whatever IO action you wanted to perform.

 On Mon, Jul 25, 2011 at 4:31 AM, Yves Parès limestr...@gmail.com wrote:
  Sorry, I'm only beginning to understand iteratees, but then how do you
  access each line of text output by the enumeratee lines within an
  iteratee?
 
  2011/7/24 Felipe Almeida Lessa felipe.le...@gmail.com
 
  On Sun, Jul 24, 2011 at 12:28 PM, Yves Parès limestr...@gmail.com
  wrote:
   If you used Data.Enumerator.Text, you would maybe benefit the lines
   function:
  
   lines :: Monad m = Enumeratee Text Text m b
 
  It gets arbitrary blocks of text and outputs lines of text.
 
   But there is something I don't get with that signature:
   why isn't it:
   lines :: Monad m = Enumeratee Text [Text] m b
   ??
 
  Lists of lines of text?
 
  Cheers, =)
 
  --
  Felipe.
 
 
  ___
  Haskell-Cafe mailing list
  Haskell-Cafe@haskell.org
  http://www.haskell.org/mailman/listinfo/haskell-cafe
 
 



 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe


 

Re: [Haskell-cafe] file splitter with enumerator package

2011-07-25 Thread yi huang
Actually, i'm wondering how to do exception handling and resource cleanup in
iteratee, e.g. your `writer` iteratee, i found it difficult, because
iteratee is designed to let enumerator manage resources.

On Sat, Jul 23, 2011 at 2:41 AM, Eric Rasmussen ericrasmus...@gmail.comwrote:

 Hi everyone,

 A friend of mine recently asked if I knew of a utility to split a
 large file (4gb in his case) into arbitrarily-sized files on Windows.
 Although there are a number of file-splitting utilities, the catch was
 it couldn't break in the middle of a line. When the standard why
 don't you use Linux? response proved unhelpful, I took this as an
 opportunity to write my first program using the enumerator package.

 If anyone has time, I'm really interested in knowing if there's a
 better way to take the incoming stream and output it directly to a
 file. The basic steps I'm taking are:

 1) Data.Enumerator.Binary.take -- grabs the user-specified number of
 bytes, then (because it returns a lazy ByteString) I use
 Data.ByteString.Lazy.hPut to output the chunk
 2) Data.Enumerator.Binary.head -- after using take for the big chunk,
 it inspects and outputs individual characters and stops after it
 outputs the next newline character
 3) I close the handle that steps 12 used to output the data and then
 repeat 12 with the next handle (an infinite lazy list of filepaths
 like part1.csv, part2.csv, and so on)

 The full code is pasted here: http://hpaste.org/49366, and while I'd
 like to get any other feedback on how to make it better, I want to
 note that I'm not planning to release this as a utility so I wouldn't
 want anyone to spend extra time performing a full code review.

 Thanks!
 Eric

 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe




-- 
http://www.yi-programmer.com/blog/
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] file splitter with enumerator package

2011-07-24 Thread Yves Parès
If you used Data.Enumerator.Text, you would maybe benefit the lines
function:

lines :: Monad m = Enumeratee Text Text m b

But there is something I don't get with that signature:
why isn't it:
lines :: Monad m = Enumeratee Text [Text] m b
??


2011/7/23 Eric Rasmussen ericrasmus...@gmail.com

 Hi Felipe,

 Thank you for the very detailed explanation and help. Regarding the first
 point, for this particular use case it's fine if the user-specified file
 size is extended by the length of a partial line (it's a compact csv file so
 if the user breaks a big file into 100mb chunks, each chunk would only ever
 be about 100mb + up to 80 bytes, which is fine for the user).

 I'm intrigued by the idea of making the bulk copy function with EB.isolate
 and EB.iterHandle, but I couldn't find a way to fit these into the larger
 context of writing to multiple file handles. I'll keep working on it and see
 if I can address the concerns you brought up.

 Thanks again!
 Eric





 On Fri, Jul 22, 2011 at 6:00 PM, Felipe Almeida Lessa 
 felipe.le...@gmail.com wrote:

 There is one problem with your algorithm.  If the user asks for 4 GiB,
 then the program will create files with *at least* 4 GiB.  So the user
 would need to ask for less, maybe 3.9 GiB.  Even so there's some
 danger, because there could be a 0.11 GiB line on the file.

 Now, the biggest problem your code won't run in constant memory.
 'EB.take' does not lazily return a lazy ByteString.  It strictly
 returns a lazy ByteString [1].  The lazy ByteString is used to avoid
 copying data (as it is basically the same as a linked list of strict
 bytestrings).  So if the user asked for 4 GiB files, this program
 would need at least 4 GiB of memory, probably more due to overheads.

 If you want to use lazy lazy ByteStrings (lazy ByteStrings with lazy
 I/O, as oposed to lazy ByteStrings with strict I/O), the enumerator
 package doesn't really buy you anything.  You should just use
 bytestring package's lazy I/O functions.

 If you want the guarantee of no leaks that enumerator gives, then you
 have to use another way of constructing your program.  One safe way of
 doing it is something like:

  takeNextLine :: E.Iteratee B.ByteString m (Maybe L.ByteString)
  takeNextLine = ...

  go :: Monad m = Handle - Int64 - E.Iteratee B.ByteString m (Maybe
 L.ByteString)
  go h n = do
mline - takeNextLine
case mline of
  Nothing - return Nothing
  Just line
| L.length line = n - L.hPut h line  go h (n - L.length line)
| otherwise - return mline

 So 'go h n' is the iteratee that saves at most 'n' bytes in handle 'h'
 and returns the leftover data.  The driver code needs to check its
 results.  Case 'Nothing', then the program finishes.  Case 'Just
 line', save line on a new file and call 'go h2 (n - L.length line)'.
 It isn't efficient because lines could be small, resulting in many
 small hPuts (bad).  But it is correct and will never use more than 'n'
 bytes (great).  You could also have some compromise where the user
 says that he'll never have lines longer than 'x' bytes (say, 1 MiB).
 Then you call a bulk copy function for 'n - x' bytes, and then call
 'go h x'.  I think you can make the bulk copy function with EB.isolate
 and EB.iterHandle.

 Cheers, =)

 [1]
 http://hackage.haskell.org/packages/archive/enumerator/0.4.13.1/doc/html/src/Data-Enumerator-Binary.html#take

 --
 Felipe.



 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] file splitter with enumerator package

2011-07-24 Thread Felipe Almeida Lessa
On Sun, Jul 24, 2011 at 12:28 PM, Yves Parès limestr...@gmail.com wrote:
 If you used Data.Enumerator.Text, you would maybe benefit the lines
 function:

 lines :: Monad m = Enumeratee Text Text m b

It gets arbitrary blocks of text and outputs lines of text.

 But there is something I don't get with that signature:
 why isn't it:
 lines :: Monad m = Enumeratee Text [Text] m b
 ??

Lists of lines of text?

Cheers, =)

-- 
Felipe.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] file splitter with enumerator package

2011-07-24 Thread Eric Rasmussen
Since the program only needs to finish a line after it's made a bulk
copy of a potentially large chunk of a file (could be 25 - 500 mb), I
was hoping to find a way to copy the large chunk in constant memory
and without inspecting the individual bytes/characters. I'm still
having some difficulty with this part if anyone has suggestions.

Thanks again,
Eric


On Sun, Jul 24, 2011 at 10:34 AM, Felipe Almeida Lessa
felipe.le...@gmail.com wrote:
 On Sun, Jul 24, 2011 at 12:28 PM, Yves Parès limestr...@gmail.com wrote:
 If you used Data.Enumerator.Text, you would maybe benefit the lines
 function:

 lines :: Monad m = Enumeratee Text Text m b

 It gets arbitrary blocks of text and outputs lines of text.

 But there is something I don't get with that signature:
 why isn't it:
 lines :: Monad m = Enumeratee Text [Text] m b
 ??

 Lists of lines of text?

 Cheers, =)

 --
 Felipe.


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] file splitter with enumerator package

2011-07-22 Thread Eric Rasmussen
Hi everyone,

A friend of mine recently asked if I knew of a utility to split a
large file (4gb in his case) into arbitrarily-sized files on Windows.
Although there are a number of file-splitting utilities, the catch was
it couldn't break in the middle of a line. When the standard why
don't you use Linux? response proved unhelpful, I took this as an
opportunity to write my first program using the enumerator package.

If anyone has time, I'm really interested in knowing if there's a
better way to take the incoming stream and output it directly to a
file. The basic steps I'm taking are:

1) Data.Enumerator.Binary.take -- grabs the user-specified number of
bytes, then (because it returns a lazy ByteString) I use
Data.ByteString.Lazy.hPut to output the chunk
2) Data.Enumerator.Binary.head -- after using take for the big chunk,
it inspects and outputs individual characters and stops after it
outputs the next newline character
3) I close the handle that steps 12 used to output the data and then
repeat 12 with the next handle (an infinite lazy list of filepaths
like part1.csv, part2.csv, and so on)

The full code is pasted here: http://hpaste.org/49366, and while I'd
like to get any other feedback on how to make it better, I want to
note that I'm not planning to release this as a utility so I wouldn't
want anyone to spend extra time performing a full code review.

Thanks!
Eric

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] file splitter with enumerator package

2011-07-22 Thread Felipe Almeida Lessa
There is one problem with your algorithm.  If the user asks for 4 GiB,
then the program will create files with *at least* 4 GiB.  So the user
would need to ask for less, maybe 3.9 GiB.  Even so there's some
danger, because there could be a 0.11 GiB line on the file.

Now, the biggest problem your code won't run in constant memory.
'EB.take' does not lazily return a lazy ByteString.  It strictly
returns a lazy ByteString [1].  The lazy ByteString is used to avoid
copying data (as it is basically the same as a linked list of strict
bytestrings).  So if the user asked for 4 GiB files, this program
would need at least 4 GiB of memory, probably more due to overheads.

If you want to use lazy lazy ByteStrings (lazy ByteStrings with lazy
I/O, as oposed to lazy ByteStrings with strict I/O), the enumerator
package doesn't really buy you anything.  You should just use
bytestring package's lazy I/O functions.

If you want the guarantee of no leaks that enumerator gives, then you
have to use another way of constructing your program.  One safe way of
doing it is something like:

  takeNextLine :: E.Iteratee B.ByteString m (Maybe L.ByteString)
  takeNextLine = ...

  go :: Monad m = Handle - Int64 - E.Iteratee B.ByteString m (Maybe
L.ByteString)
  go h n = do
mline - takeNextLine
case mline of
  Nothing - return Nothing
  Just line
| L.length line = n - L.hPut h line  go h (n - L.length line)
| otherwise - return mline

So 'go h n' is the iteratee that saves at most 'n' bytes in handle 'h'
and returns the leftover data.  The driver code needs to check its
results.  Case 'Nothing', then the program finishes.  Case 'Just
line', save line on a new file and call 'go h2 (n - L.length line)'.
It isn't efficient because lines could be small, resulting in many
small hPuts (bad).  But it is correct and will never use more than 'n'
bytes (great).  You could also have some compromise where the user
says that he'll never have lines longer than 'x' bytes (say, 1 MiB).
Then you call a bulk copy function for 'n - x' bytes, and then call
'go h x'.  I think you can make the bulk copy function with EB.isolate
and EB.iterHandle.

Cheers, =)

[1] 
http://hackage.haskell.org/packages/archive/enumerator/0.4.13.1/doc/html/src/Data-Enumerator-Binary.html#take

-- 
Felipe.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] file splitter with enumerator package

2011-07-22 Thread Eric Rasmussen
Hi Felipe,

Thank you for the very detailed explanation and help. Regarding the first
point, for this particular use case it's fine if the user-specified file
size is extended by the length of a partial line (it's a compact csv file so
if the user breaks a big file into 100mb chunks, each chunk would only ever
be about 100mb + up to 80 bytes, which is fine for the user).

I'm intrigued by the idea of making the bulk copy function with EB.isolate
and EB.iterHandle, but I couldn't find a way to fit these into the larger
context of writing to multiple file handles. I'll keep working on it and see
if I can address the concerns you brought up.

Thanks again!
Eric




On Fri, Jul 22, 2011 at 6:00 PM, Felipe Almeida Lessa 
felipe.le...@gmail.com wrote:

 There is one problem with your algorithm.  If the user asks for 4 GiB,
 then the program will create files with *at least* 4 GiB.  So the user
 would need to ask for less, maybe 3.9 GiB.  Even so there's some
 danger, because there could be a 0.11 GiB line on the file.

 Now, the biggest problem your code won't run in constant memory.
 'EB.take' does not lazily return a lazy ByteString.  It strictly
 returns a lazy ByteString [1].  The lazy ByteString is used to avoid
 copying data (as it is basically the same as a linked list of strict
 bytestrings).  So if the user asked for 4 GiB files, this program
 would need at least 4 GiB of memory, probably more due to overheads.

 If you want to use lazy lazy ByteStrings (lazy ByteStrings with lazy
 I/O, as oposed to lazy ByteStrings with strict I/O), the enumerator
 package doesn't really buy you anything.  You should just use
 bytestring package's lazy I/O functions.

 If you want the guarantee of no leaks that enumerator gives, then you
 have to use another way of constructing your program.  One safe way of
 doing it is something like:

  takeNextLine :: E.Iteratee B.ByteString m (Maybe L.ByteString)
  takeNextLine = ...

  go :: Monad m = Handle - Int64 - E.Iteratee B.ByteString m (Maybe
 L.ByteString)
  go h n = do
mline - takeNextLine
case mline of
  Nothing - return Nothing
  Just line
| L.length line = n - L.hPut h line  go h (n - L.length line)
| otherwise - return mline

 So 'go h n' is the iteratee that saves at most 'n' bytes in handle 'h'
 and returns the leftover data.  The driver code needs to check its
 results.  Case 'Nothing', then the program finishes.  Case 'Just
 line', save line on a new file and call 'go h2 (n - L.length line)'.
 It isn't efficient because lines could be small, resulting in many
 small hPuts (bad).  But it is correct and will never use more than 'n'
 bytes (great).  You could also have some compromise where the user
 says that he'll never have lines longer than 'x' bytes (say, 1 MiB).
 Then you call a bulk copy function for 'n - x' bytes, and then call
 'go h x'.  I think you can make the bulk copy function with EB.isolate
 and EB.iterHandle.

 Cheers, =)

 [1]
 http://hackage.haskell.org/packages/archive/enumerator/0.4.13.1/doc/html/src/Data-Enumerator-Binary.html#take

 --
 Felipe.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe