Beginners Digest, Vol 45, Issue 18

beginners-request Wed, 14 Mar 2012 04:00:40 -0700

Send Beginners mailing list submissions to
        [email protected]

To subscribe or unsubscribe via the World Wide Web, visit
        http://www.haskell.org/mailman/listinfo/beginners
or, via email, send a message with subject or body 'help' to
        [email protected]

You can reach the person managing the list at
        [email protected]

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Beginners digest..."

Today's Topics:

   1. Re:  populating a bloom filter; stymied by ST monad (Joey Hess)
   2. Re:  Processing a list of files the Haskell way (Michael Schober)
   3. Re:  populating a bloom filter;   stymied by ST monad (Ozgur Akgun)
   4. Re:  Processing a list of files the Haskell way (Chadda? Fouch?)
   5. Re:  Processing a list of files the Haskell way (Michael Schober)

----------------------------------------------------------------------

Message: 1
Date: Tue, 13 Mar 2012 11:16:12 -0400
From: Joey Hess <[email protected]>
Subject: Re: [Haskell-beginners] populating a bloom filter; stymied by
        ST monad
To: Chadda? Fouch? <[email protected]>
Cc: [email protected]
Message-ID: <[email protected]>
Content-Type: text/plain; charset="iso-8859-1"

Chadda? Fouch? wrote:
> getValues update initial = go initial =<< gen
>        where
>                go v [] = return v
>                go v (f:fs) = do
                        x <- val f
> 
> You say that this stream lazily, so I deduce that gen produce a lazy
> IO list. So you should be able to use gen in conjunction with easyList
> to get your bloom filter lazily. I'm not sure what the problem is ?
> How exactly do you get the elements of your bloom filter from gen
> input ?

gen produces a lazy list, but it's then transformed using another IO
operation. I added the relevant line back above. I did it that way to
preserve laziness. An alternate, simpler getvalues suitable for
easyList[1] would be the following, but due to the sequencing done by
mapM, the list does not stream out lazily.

getValues :: IO [v]
getValues update initial = mapM val =<< gen

-- 
see shy jo

[1] If easyList didn't also destroy laziness by running length, anyway..
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 828 bytes
Desc: Digital signature
URL: 
<http://www.haskell.org/pipermail/beginners/attachments/20120313/fe0f0067/attachment-0001.pgp>

------------------------------

Message: 2
Date: Tue, 13 Mar 2012 16:34:10 +0100
From: Michael Schober <[email protected]>
Subject: Re: [Haskell-beginners] Processing a list of files the
        Haskell way
To: [email protected]
Message-ID: <[email protected]>
Content-Type: text/plain; charset=UTF-8; format=flowed

Hi again,

I'm still trying to figure out my earlier problem regarding the 
directory tree and functions on the files.

What I'm really trying to do sounds simple enough: find duplicate files 
for a complete directory, given a root. The original approach I had in 
mind was to compute every file's checksum, which I could feed into a map 
of (checksums,[filepaths]).

On 03/10/2012 01:48 PM, Chadda? Fouch? wrote:
> Note that you solution isn't very "functional-like", but rather
> imperative. On the other hand, making it more functional in this
> particular case come with its own brand of subtle difficulties.

I now wonder how to achieve my goal properly in Haskell. Obviously I 
have some sequential parts of my program (reading directory trees is I/O 
and therefore monadic), but it seems to me that could also be some 
functional part as well (pairing files with checksums, pairing checksums 
with common filepaths, filtering only checksums with more than one 
filepath, maybe even sorting, etc.).

How would you go about to do it? Which parts of the program should be 
monadic, which functional? Would you create a custom-build monad or 
would you use build-ins (which ones?)?

I would really like to have some pointers how an experienced Haskell 
developer would go about such a thing, since I want to acquire a Haskell 
feel of programming rather than keep on programming in my old ways, now 
in Haskell. I would also be happy to be simply pinpointed to other 
applications which source code are exemplary, but not too huge to dig 
through in a sensible amount of time. (I'm currently trying to work 
through the Darcs source code, but that seems a bit too overkill.)

Thanks for any comments and thoughts.

Best,
Michael

------------------------------

Message: 3
Date: Tue, 13 Mar 2012 16:03:53 +0000
From: Ozgur Akgun <[email protected]>
Subject: Re: [Haskell-beginners] populating a bloom filter;     stymied by
        ST monad
To: Joey Hess <[email protected]>
Cc: [email protected]
Message-ID:
        <CALzazPCWXTniJ=qswh5blsxrx4gzdrqewdfwrbhz4xp+0va...@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi,

You can probably use some unsafeInterleaveIO to keep things lazier. Note
that I am not actually suggesting this, an explicit streamy (iteratee,
enumerator, conduit, pipes, ..) solution as suggested by others is probably
superior. However I find that area too complicated for now and waiting for
libraries to settle a bit.

I'll give an example. How to incorporate this into your problem is up to
you, if you ever want to do that.

Try the following:

import Control.Applicative
import System.IO.Unsafe

-- f is a weird function. it prints the result of a computation before
returning it.
f :: Show a => IO a -> IO a
f i = do j <- i; print j; return j

-- let's have an IO [Int] list to use in tests.
xs :: IO [Int]
xs = return [1..10]

ghci> mapM f xs -- prints numbers from 1 to 10, and returns a list: [1..10]
ghci> take 3 <$> mapM f xs -- prints numbers from 1 to 10, and returns a
list: [1..3]

What if we want to be 'lazier', only print those numbers that are in the
output list? Albeit unsafe in certain cases, one way is the following:

g :: Show a => IO a -> IO a
g = unsafeInterleaveIO . f

ghci> take 3 <$> mapM g xs -- prints numbers from 1 to 3 and returns a list
[1..3]

HTH,
Ozgur

On 13 March 2012 15:16, Joey Hess <[email protected]> wrote:

> Chadda? Fouch? wrote:
> > getValues update initial = go initial =<< gen
> >        where
> >                go v [] = return v
> >                go v (f:fs) = do
>                         x <- val f
> >
> > You say that this stream lazily, so I deduce that gen produce a lazy
> > IO list. So you should be able to use gen in conjunction with easyList
> > to get your bloom filter lazily. I'm not sure what the problem is ?
> > How exactly do you get the elements of your bloom filter from gen
> > input ?
>
> gen produces a lazy list, but it's then transformed using another IO
> operation. I added the relevant line back above. I did it that way to
> preserve laziness. An alternate, simpler getvalues suitable for
> easyList[1] would be the following, but due to the sequencing done by
> mapM, the list does not stream out lazily.
>
> getValues :: IO [v]
> getValues update initial = mapM val =<< gen
>
> --
> see shy jo
>
> [1] If easyList didn't also destroy laziness by running length, anyway..
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://www.haskell.org/pipermail/beginners/attachments/20120313/b89a2fe3/attachment-0001.htm>

------------------------------

Message: 4
Date: Tue, 13 Mar 2012 19:47:19 +0100
From: Chadda? Fouch? <[email protected]>
Subject: Re: [Haskell-beginners] Processing a list of files the
        Haskell way
To: Michael Schober <[email protected]>
Cc: [email protected]
Message-ID:
        <canfjzrya4qpotny-ocjxgag-4ygvgqbq57mokbh_nnxueke...@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8

On Tue, Mar 13, 2012 at 4:34 PM, Michael Schober <[email protected]> wrote:
> Hi again,
>
> I'm still trying to figure out my earlier problem regarding the directory
> tree and functions on the files.
>
> What I'm really trying to do sounds simple enough: find duplicate files for
> a complete directory, given a root. The original approach I had in mind was
> to compute every file's checksum, which I could feed into a map of
> (checksums,[filepaths]).

Is there a problem with the solution I gave you ?

> On 03/10/2012 01:48 PM, Chadda? Fouch? wrote:
>>
>> Note that you solution isn't very "functional-like", but rather
>> imperative. On the other hand, making it more functional in this
>> particular case come with its own brand of subtle difficulties.
>
>
> I now wonder how to achieve my goal properly in Haskell. Obviously I have
> some sequential parts of my program (reading directory trees is I/O and
> therefore monadic), but it seems to me that could also be some functional
> part as well (pairing files with checksums, pairing checksums with common
> filepaths, filtering only checksums with more than one filepath, maybe even
> sorting, etc.).
>
> How would you go about to do it? Which parts of the program should be
> monadic, which functional? Would you create a custom-build monad or would
> you use build-ins (which ones?)?
>
> I would really like to have some pointers how an experienced Haskell
> developer would go about such a thing, since I want to acquire a Haskell
> feel of programming rather than keep on programming in my old ways, now in
> Haskell. I would also be happy to be simply pinpointed to other applications
> which source code are exemplary, but not too huge to dig through in a
> sensible amount of time. (I'm currently trying to work through the Darcs
> source code, but that seems a bit too overkill.)
>

An "experienced" Haskell programmer would probably reuse some of the
"streaming" solutions on Hackage, combine it with md5 and a map :

> module Main where
>
> import Data.Conduit.Filesystem (traverse)
> import qualified Data.Conduit.List as CL
> import Data.Conduit
>
> import Data.Digest.Pure.MD5 (MD5Digest)
> import Crypto.Conduit (hashFile)
>
> import qualified Data.Map as M
> import qualified Filesystem.Path.CurrentOS as FP
> import System.Environment
>
> duplicates :: FilePath -> IO [[FilePath]]
> duplicates dir = runResourceT $ do
>   md5s <- traverse False (FP.decodeString dir) $$ CL.mapM process =$ CL.fold 
> buildMap M.empty
>   return . M.elems . M.filter ((>1).length) $ md5s
>     where
>       process :: FP.FilePath -> IO (MD5Digest, FilePath)
>       process fp = do
>         let strFp = FP.encodeString fp
>         md5 <- hashFile strFp
>         return (md5,strFp)
>       buildMap m (md5,fp) = M.insertWith' (flip (++)) md5 [fp] m
>
> main = do
>   [dir] <- getArgs
>   print =<< duplicates dir

This may be a bit too much and there's a bit of "noise" introduced by
the two FilePath type (the standard one which is a synonym for String,
and the one used in Filesystem.Path and filesystem-conduit which was
introduced to better cope with some exotic filepath and encoding in
filesystems), on the other hand, this is not too inefficient (though
it may be improved by using another structure than Map, and maybe by
using nano-md5 rather than pureMD5) and the memory used is only
proportional to the number of files in the directory which should be
ok.

-- 
Jeda?

------------------------------

Message: 5
Date: Wed, 14 Mar 2012 11:44:53 +0100
From: Michael Schober <[email protected]>
Subject: Re: [Haskell-beginners] Processing a list of files the
        Haskell way
To: Chadda? Fouch? <[email protected]>
Cc: [email protected]
Message-ID: <[email protected]>
Content-Type: text/plain; charset=UTF-8; format=flowed

On 03/13/2012 07:47 PM, Chadda? Fouch? wrote:
> Is there a problem with the solution I gave you ?
No and yes. No in the sense that it most certainly did solve the open 
files problem. Yes in the sense that it still consumed a huge amount of 
memory and, more important to me, left me feeling that I was certainly 
doing something conceptually wrong. That was the reason for my repost, 
since I wanted to learn the 'right' (or at least more suitable) concept.

> An "experienced" Haskell programmer would probably reuse some of the
> "streaming" solutions on Hackage, combine it with md5 and a map :
This seems to be the 'Haskell-way' solution I was looking for. I'm still 
in the progress of digging through a lot of material I found ([1-3], 
links below for other interested readers), but I think that this will be 
very helpful in the future with similar projects as well.

I have yet to test the code you send me last time, but I will give some 
feedback of it as soon as I get the chance.

Thank you very much!
Michael

References:
[1] The original conduit article I found:
http://www.yesodweb.com/book/conduit
[2] Conduits seem to be a development progress originating from 
enumerators/iteratee. Therefore, it seems to be a good idea to read up 
on those prior:
http://www.yesodweb.com/book/enumerator
[3] There's also an article in The Monad.Reader, issue 16 about Iteratee:
http://themonadreader.wordpress.com/2010/05/12/issue-16/

------------------------------

_______________________________________________
Beginners mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/beginners

End of Beginners Digest, Vol 45, Issue 18
*****************************************

Beginners Digest, Vol 45, Issue 18

Reply via email to