Beginners Digest, Vol 28, Issue 22

beginners-request Tue, 12 Oct 2010 10:11:41 -0700

Send Beginners mailing list submissions to
        [email protected]

To subscribe or unsubscribe via the World Wide Web, visit
        http://www.haskell.org/mailman/listinfo/beginners
or, via email, send a message with subject or body 'help' to
        [email protected]


You can reach the person managing the list at
        [email protected]

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Beginners digest..."


Today's Topics:

   1. Re:  Memory tuning (Thorsten Hater)
   2. Re:  Memory tuning (Daniel Fischer)
   3. Re:  Reading Multiple Files and Iterate Function  Application
      (Lorenzo Isella)
   4. Re:  Re: how to read file with locking (Brandon S Allbery KF8NH)
   5. Re:  Reading Multiple Files and Iterate Function  Application
      (Daniel Fischer)
   6.   Word8, Word32, ByteString, Int, etc. conversions. (David McBride)


----------------------------------------------------------------------

Message: 1
Date: Mon, 11 Oct 2010 23:29:30 +0200
From: Thorsten Hater <[email protected]>
Subject: Re: [Haskell-beginners] Memory tuning
To: Daniel Fischer <[email protected]>
Cc: [email protected]
Message-ID: <[email protected]>
Content-Type: text/plain; charset=UTF-8

 On 11.10.2010 17:54, Daniel Fischer wrote:
> On Monday 11 October 2010 16:14:11, Thorsten Hater wrote:
>> Not only is the actual memory consumption quite high, but the
>> GC/computation ratio is almost 1:1.
> You construct a largish map (about 350000 entries), doing a lot (more than 
> a million) of insertWith's.
> That means frequent rebalancing (for new circumferences) and many updates 
> of already present values (which in all likelihood are heap allocated 
> Ints), both give the garbage collector a lot of work. Further, you 
> construct a lot of lists, which need to be GCed too.
> Also, the memory overhead of Maps is nontrivial, I think five or six words 
> per item, so a Map Int Int needs about size*8(or 7)*sizeof(Int) bytes.
>
> You can reduce the GC time by specifying a larger allocation area (e.g.
> +RTS -A16M -RTS, the default size is 512K) thus making GC less frequent.
>
>> I assume that there is something to be done regarding strictness,
> Nothing obvious.
> But for the computation pattern you use, since the ratio of possible values 
> to occurring values is not too high, using accumArray (on an Unboxed array) 
> is far better.
>
>> but a series of experiments with
>> BangPatterns return no yields at all (using foldr instead of foldl'
>> produces stack overflow)
> Yes, using foldr with insert(With)s into Maps is a Bad Ideaâ¢
>
Thank you for pointing me to Unboxed Arrays and accumarray, it works
like a charm.
The new statistics:

      35,180,328 bytes allocated in the heap
          13,608 bytes copied during GC
      12,027,912 bytes maximum residency (2 sample(s))
         594,520 bytes maximum slop
              13 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:    43 collections,     0 parallel,  0.00s,  0.00s elapsed
  Generation 1:     2 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    0.14s  (  0.17s elapsed)
  GC    time    0.00s  (  0.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    0.14s  (  0.17s elapsed)

  %GC time       0.0%  (0.6% elapsed)

  Alloc rate    251,288,057 bytes per MUT second

  Productivity 100.0% of total user, 82.9% of total elapsed

I don't think there is much left to do in terms of optimization.
So I'm left with the question of when to use Data.Map, if accumarray
is so much better even for heavy number crunching as this.
Especially when Data.Map advertises as efficient.
(Even without Unboxed, normal Arrays beat Maps by a factor of 2)
Best regards.


------------------------------

Message: 2
Date: Tue, 12 Oct 2010 00:52:07 +0200
From: Daniel Fischer <[email protected]>
Subject: Re: [Haskell-beginners] Memory tuning
To: Thorsten Hater <[email protected]>
Cc: [email protected]
Message-ID: <[email protected]>
Content-Type: text/plain;  charset="utf-8"

On Monday 11 October 2010 23:29:30, Thorsten Hater wrote:
>
> Thank you for pointing me to Unboxed Arrays and accumarray, it works
> like a charm.
> The new statistics:
>
>       35,180,328 bytes allocated in the heap
>           13,608 bytes copied during GC
>       12,027,912 bytes maximum residency (2 sample(s))
>          594,520 bytes maximum slop
>               13 MB total memory in use (0 MB lost due to fragmentation)
>
>   Generation 0:    43 collections,     0 parallel,  0.00s,  0.00s
> elapsed Generation 1:     2 collections,     0 parallel,  0.00s,  0.00s
> elapsed
>
>   INIT  time    0.00s  (  0.00s elapsed)
>   MUT   time    0.14s  (  0.17s elapsed)
>   GC    time    0.00s  (  0.00s elapsed)
>   EXIT  time    0.00s  (  0.00s elapsed)
>   Total time    0.14s  (  0.17s elapsed)
>
>   %GC time       0.0%  (0.6% elapsed)
>
>   Alloc rate    251,288,057 bytes per MUT second
>
>   Productivity 100.0% of total user, 82.9% of total elapsed

Yes, that's more like it :)

>
> I don't think there is much left to do in terms of optimization.

Generate the primitive triangles by a more efficient method, all those 
gcd's take a lot of time.

> So I'm left with the question of when to use Data.Map, if accumarray
> is so much better even for heavy number crunching as this.

That's not really heavy number crunching, it's a pique-nique.

There are several considerations.

- sparse (irregular) data
If only one in 100 possible data points actually occurs, an array is a huge 
waste of space. You quickly lose cache-locality then, and Maps are often 
faster in those cases.

- unknown bounds
That rules out arrays.

- pattern of computation
For this problem, accumArray is perfectly suited, but that is not always 
the case. Consider problems where you update your Map using previously 
stored data, then accumArray isn't applicable. In such cases Maps give you 
nice code and usually decent performance, although in most cases if your 
data is not sparse, using an unboxed mutable array (STUArray) if possible 
gives you much better performance - but at the cost of uglier code. A boxed 
mutable array (STArray) will in many cases also perform better than a Map, 
but usually not as well as an unboxed one.


Of course, arrays can't be used (or only with great convolutions) for stuff 
like Map String a, where the key can't be mapped easily to an array index.

> Especially when Data.Map advertises as efficient.

It is efficient for what it does (of course, there's always room for 
improvement), but it's not the appropriate data structure for everything.
By default, Haskell's data structures are immutable, thus if you modify a 
Map, you get a new one, which typically requires copying of O(log size) 
nodes (and the bulk of the Map is shared, otherwise performance would be 
abysmal).
But still, that is a nontrivial operation, much more costly than modifying 
a mutable map in an impure language (where such a modification is often 
just relinking a few pointers).
So if you have an algorithm that requires frequent updates, the cost of 
these operations makes itself felt.

But what happens if you use an array instead of a Map?
If the keys map easily to array indices, that is very easy to do, but if 
you use immutable arrays, each update requires copying the entire array (in 
case of boxed arrays only the pointers to the values are copied, but still, 
allocating a new memory block and copying 10 million pointers likely takes 
much longer than copying 20-25 nodes of a Map). If you use a mutable array, 
such a modification is cheap, only one array slot needs to be changed. But 
that ties you to a mutability monad (ST s or IO, mostly), which has its 
downsides too.
And if data is sparse, you again waste a lot of space.

And if the keys don't map easily to indices, well, you can implement a Map 
as an Array Int (Key,Value) or a pair (Array Int Key, Array Int Value), 
keeping the entries sorted by the keys.
In fact, if you build the map once and use it later for a lot of lookups, 
that is often better than a Map (regarding performance).
Both give you O(log size) lookup, but the array solution can have less 
space overhead.
However, inserting a new element or deleting an element becomes an O(size) 
operation, so if you have to do that a lot, this is a terrible solution, 
you should use a Map.


> (Even without Unboxed, normal Arrays beat Maps by a factor of 2)
> Best regards.



------------------------------

Message: 3
Date: Tue, 12 Oct 2010 11:02:47 +0200
From: Lorenzo Isella <[email protected]>
Subject: Re: [Haskell-beginners] Reading Multiple Files and Iterate
        Function        Application
To: Thomas Miedema <[email protected]>
Cc: [email protected]
Message-ID: <[email protected]>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

On 10/11/2010 07:20 PM, Thomas Miedema wrote:
> Hi,
>
> replace
> let fl = getAllLengths nums
>
> by
> fl <- getAllLengths nums
>
> , since getAllLengths returns a monadic action.
>
> The author of the following book is much better at explaining why this
> is so than I am: http://learnonlineyouahaskell.com/
> <http://learnyouahaskell.com/>. May I suggest you read it cover to
> cover, it's really really good. It is probably the best way to learn
> Haskell at the moment, together with trying out code snippets like
> you're doing right now.
>
> Regards,
> Thomas
>
>
>
> On Mon, Oct 11, 2010 at 6:06 PM, Lorenzo Isella
> <[email protected] <mailto:[email protected]>> wrote:
>
>     Thanks a lot Daniel, but I am a bit lost (up to not long ago I did
>     not even know the existence of a control monad...and some
>     unstructured reading did not help).
>     Some online research about mapM and fmap led me here
>     http://en.wikibooks.org/wiki/Haskell/Category_theory
>     and I think I am a bit astray at this point ;-)
>
>     Why does my "simple" snippet below raise a number of errors?
>
>     Cheers
>
>     Lorenzo
>
>
>     import Data.Ord
>
>     import Data.List
>
>     main :: IO ()
>
>     main = do
>
>       let nums=[1,2]
>
>       let fl = getAllLengths nums
>
>       putStrLn "fl is, "
>       print fl
>
>
>
>     filename :: Int -> FilePath
>     filename i = "file" ++ show i ++ ".dat"
>
>     fileLength :: FilePath -> IO Int
>     fileLength file = fmap length (readFile file)
>
>     getAllLengths :: [Int] -> IO [Int]
>     getAllLengths nums = mapM (fileLength . filename) nums
>
>
>
>     On 10/11/2010 05:21 PM, Daniel Fischer wrote:
>
>         On Monday 11 October 2010 16:56:58, Lorenzo Isella wrote:
>
>             Dear All,
>             Another I/O question.
>             Let us say that you are given a list of files file1.dat,
>             file2.dat...file10.dat and so on (i.e. every file is indexed
>             by a number
>             and every file is a single column where every entry is a
>             string without
>             spaces).
>             In the snippet below I read file1.dat, convert it to a list
>             and then
>             print out its length.
>             Now, how can I iterate the process on file1.dat, file2.dat
>             and file3.dat
>             and store the lengths in a list?
>
>
>         fileLength :: FilePath ->  IO Int
>         fileLength file = fmap length (readFile file)
>
>         filename :: Int ->  FilePath
>         filename i = "file" ++ show i ++ ".dat"
>
>         getAllLengths :: [Int] ->  IO [Int]
>         getAllLengths nums = mapM (fileLength . filename) nums
>
>
>         If you want something other than the character count, instead of
>         fileLength
>         use e.g.
>
>         countLines :: FilePath ->  IO Int
>         countLines file = fmap (length . lines) (readFile file)
>
>         or whatever you're interested in.
>
>         Another nice thing is often forM (from Control.Monad)
>
>         forM nums $ \i ->  do
>             let filename = "file" ++ show i ++ ".dat"
>             contents<- readFile filename
>             let result = function contents
>             doSomethingOrNot
>             return result
>
>             I would like to map the file reading and following
>             operations on the
>             list [1,2.3], but that is giving me a headache.
>             It is relatively easy to create the file name
>
>             filename="file"++(show i)++".dat"   , for i=1,2,3
>
>             but it the the iteration part that is giving me troubles.
>             Any suggestion is appreciated.
>             Cheers
>
>             Lorenzo
>
>
>     _______________________________________________
>     Beginners mailing list
>     [email protected] <mailto:[email protected]>
>     http://www.haskell.org/mailman/listinfo/beginners
>
>

Thanks Thomas. Yep, I do need some extra reading unfortunately.
One question: if I was to apply a function on many files file1, 
file2...using e.g. Python, this would be my pipeline
read file1
do stuff on file 1

read file2
do stuff on file 2

......

Now, due to the laziness of haskell, can I here resort to this approach

read file1, file2... into a single list

map (do-my-stuff) on list

As far as I understand, this should not result e.g. into a huge RAM 
consumptions since files are read and processed only when needed (hence 
one at the time).
Am I on the right track?
Cheers

Lorenzo




------------------------------

Message: 4
Date: Tue, 12 Oct 2010 07:31:53 -0400
From: Brandon S Allbery KF8NH <[email protected]>
Subject: Re: [Haskell-beginners] Re: how to read file with locking
To: [email protected]
Message-ID: <[email protected]>
Content-Type: text/plain; charset=UTF-8

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/11/10 01:14 , Jimmy Wylie wrote:
>     I don't think I quite understand. How is withFile exception-safe? Under
> the covers it's using openFile. I was under the impression withFile was just

Under the covers it's also using Control.Exception.bracket, which you should
probably look at.

- -- 
brandon s. allbery     [linux,solaris,freebsd,perl]      [email protected]
system administrator  [openafs,heimdal,too many hats]  [email protected]
electrical and computer engineering, carnegie mellon university      KF8NH
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.10 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAky0RykACgkQIn7hlCsL25UEDwCgrx0fhk1b/MGjWdwixp9vIZW9
zyYAoJyV5p0GG4Ye/wXhvW0UjYZyMuNW
=1Ces
-----END PGP SIGNATURE-----


------------------------------

Message: 5
Date: Tue, 12 Oct 2010 14:22:08 +0200
From: Daniel Fischer <[email protected]>
Subject: Re: [Haskell-beginners] Reading Multiple Files and Iterate
        Function        Application
To: [email protected]
Cc: Lorenzo Isella <[email protected]>
Message-ID: <[email protected]>
Content-Type: text/plain;  charset="iso-8859-1"

On Tuesday 12 October 2010 11:02:47, Lorenzo Isella wrote:
> Thanks Thomas. Yep, I do need some extra reading unfortunately.
> One question: if I was to apply a function on many files file1,
> file2...using e.g. Python, this would be my pipeline
> read file1
> do stuff on file 1
>
> read file2
> do stuff on file 2
>
> ......
>
> Now, due to the laziness of haskell, can I here resort to this approach
>
> read file1, file2... into a single list
>
> map (do-my-stuff) on list
>
> As far as I understand, this should not result e.g. into a huge RAM
> consumptions since files are read and processed only when needed (hence
> one at the time).
> Am I on the right track?

Yes, but there are dangers on that way.
With readFile, the contents are read lazily upon demand, but the file is 
opened immediately for reading. So

contentsList <- mapM readFile fileList

or

allContents <- fmap concat $ mapM readFile fileList

can make you run out of file handles if fileList is long enough.

Also, the file handles aren't closed until the entire contents of the file 
has been read (there are a few situations where the handle is closed 
earlier) and they're not guaranteed to be immediately closed when the end 
of the file has been reached, they could linger for a GC or two.
That means you can also run out of file handles when you process the files 
sequentially (if you have a bad consumption pattern).

The memory usage depends on your consumption pattern, independent of 
whether theSting[s] you process come[s] from file readings or from a non-IO 
generator.
If you keep references to the beginning of the list, you get a leak, if you 
consume the list sequentially, it runs in small space.

> Cheers
>
> Lorenzo



------------------------------

Message: 6
Date: Tue, 12 Oct 2010 12:38:31 -0400
From: David McBride <[email protected]>
Subject: [Haskell-beginners]    Word8, Word32, ByteString, Int, etc.
        conversions.
To: [email protected]
Message-ID:
        <[email protected]>
Content-Type: text/plain; charset="iso-8859-1"

I'm writing a pcap sniffer, and I have IP addresses from two different
libraries that are equivalent, but typed differently.

Data.IP.IPv4 Data.ByteString.Internal.ByteString

and

Network.Info.IPv4 = Network.Info.IPv4 !GHC.Word.Word32

Both are probably identical, but I cannot for the life of me figure out how
to get from one to the other.  Bytestring has the ability to take arrays of
Word8, how do I split Word32 into Word8's?

I have also had this problem in the past with Word8's, Octets and Chars.
Beyond this specific problem, what is a good strategy for figuring out how
to do these conversions in the future?  Sometimes I find a function
somewhere, but I have never found a general way to deal with it.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
http://www.haskell.org/pipermail/beginners/attachments/20101012/d0efdf36/attachment-0001.html

------------------------------

_______________________________________________
Beginners mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/beginners


End of Beginners Digest, Vol 28, Issue 22
*****************************************

Beginners Digest, Vol 28, Issue 22

Reply via email to