Bulat Ziganshin ha scritto:
Hello Manlio,

Monday, March 2, 2009, 6:30:51 PM, you wrote:

The process-data-1 program parse the entire dataset using about 1.4 GB
of memory (3x increment).

This is strange.
The memory required is proportional to the number of ratings.
It may be IntMap the culprit, or the garbage collector than does not release the memory to the operating system (or, worse, does not deallocate all used temporary memory).

ghc has 2 garbage collectors.

> [...]

I already tried to enable the compacting collection (-c RTS option).

Here are some numbers
(when parsing only 5000 movie files, with process-data-1):

1) With default collection algorithm, I have:

    real        1m4.599s
    user        0m49.131s
    sys         0m1.492s

    409 MB

2) With -c option:

    real  1m45.197s
    user  0m59.248s
    sys   0m1.640s

    418 MB


So, nothing changed.

>
> moreover, you may set up"growing factor". with a g.f. of
1.5, for example, memory will be collected once heap will become 1.5x
larger than real memory usage after last GC. this effectively
guarantees that memory overhead will never be over this factor


Thanks.
This seems to be effective (but it also reduce performances).

3) With -F1 option

    real  9m59.789s
    user  9m41.844s
    sys   0m5.608s

    264 MB

I have to parse the whole data set, to check if memory usage is good.


look at GHC manual, RTS switches section. and last - GHC never returns
memory to the OS, there should be ticket on this


The same problem with Python.


By the way: I have written the first version of the program to parse Netflix training data set in D.
I also used ncpu * 1.5 threads, to parse files concurrently.

However execution was *really* slow, due to garbage collection.
I have also tried to disable garbage collection, and to manually run a garbage cycle from time to time (every 200 file parsed), but the performance were the same.

Running the Haskell version with -F1 *seems* (I have to check) to be as slow as the D version.

So, it seems that for this class of problems it is better to do manual memory management (and it is not even hard). When I find some time I will try to write a C++ version (I also have a Python version, but it is very memory inefficient).


At least I hope that I can serialize the parsed data in a binary file, and then read it again, with optimized memory usage.



Thanks  Manlio Perillo

_______________________________________________
Haskell-Cafe mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to