On Tuesday, 23 April 2013 00:00:33 UTC+8, Volker Braun wrote:
The first question is, are you actually running out of ram? The
garbage collector seems to have triggered full collections at the
4gb mark, and memory fragmentation might have left you with 900mb of
address space that is mainly empty. Also, do you really need all 4
million graphs in memory simultaneously? Use @parallel to iterate
over them if thats all you need.
No, I am not running out of RAM with the small sample, but I will
certainly run out of RAM if everything scales as it appears to be doing.
The real issue for me is that the file of 4.2 million graphs occupies
about 0.6 Gb on disk, and so when I was roughly working out what to do,
it never occurred to me that there would be the slightest problem in
storing it in memory (my machine has 16Gb RAM). Ultimately I need to
create matroids from the graphs and keep only pairwise non-isomorphic
ones. With those sizes, it seemed that it would be easy to just store
everything in memory, and work from there. I could write a more
complicated and clever routine to work in batches or use some additional
theory etc, but I like to do that only when necessary.
So when this happened to my tiny sample file, I assumed that I must be
doing something spectacularly stupid, for example accidentally calling a
method that keeps producing new objects (rather than mutating an
existing object), hence the posting.
I did some more experiments: I took a file with about 1 million lines
and deleted all references to Graph, so that it was just a big bunch of
tuples, and I wrapped it up into one big array rather than repeatedly
calling "append").
gs = [
[(0,1,0),(0,1,1),(0,1,2),(0,1,3),(0,1,4),(0,1,5),(0,1,6),(0,2,7),(0,2,8),(0,2,9),(0,1,10),(0,1,11),(1,2,12),(1,2,13),(1,2,14),(3,4,15)],
[(0,1,0),(0,1,1),(0,1,2),(0,1,3),(0,1,4),(0,1,5),(0,1,6),(0,2,7),(0,2,8),(0,2,9),(0,1,10),(0,1,11),(1,2,12),(1,2,13),(1,3,14),(2,4,15)],
... a million more lines
This file occupies 157Mb on disk.
I made a variant of this file suitable for input into another computer
algebra system, in this case Magma
gs := [
[[0,1,0],[0,1,1],[0,1,2],[0,1,3],[0,1,4],[0,1,5],[0,1,6],[0,2,7],[0,2,8],[0,2,9],[0,1,10],[0,1,11],[1,2,12],[1,2,13],[1,2,14],[3,4,15]],
[[0,1,0],[0,1,1],[0,1,2],[0,1,3],[0,1,4],[0,1,5],[0,1,6],[0,2,7],[0,2,8],[0,2,9],[0,1,10],[0,1,11],[1,2,12],[1,2,13],[1,3,14],[2,4,15]],
... a million more lines
Then I tried
load "tst.magma" (in Magma)
%runfile tst.sage (in Sage)
to see the difference...
With Magma, it took 45 seconds to read in the file, and the memory usage
(as reported by ps) went up seemingly monotonically from about 10Mb to
about 4.8Gb over that time period.
With Sage, I had to kill the job after 12 minutes because the process
had blown out to 12Gb or Real Mem and 36 Gb of virtual memory and the
computer was barely responsive.
This is making it hard for me to work with large data sets, but perhaps
Sage is simply the wrong tool for this job?