Re: [Neo4j] Constantly growing heap / out of memory exception when loading CSV file

Michael Hunger Sat, 26 Jul 2014 16:08:25 -0700

Hi Johan,

unfortunately it's a bit trickier. There is the fact that MATCH operations
(which also includes MERGE) pull in all rows when followed by CREATE
operations.
This happens to protect from situations where you match against the same
data that you create and which would lead to an infinite loop of created
data.


In general I think for 12M rows you might have more success with the
batch-importer: http://github.com/jexp/batch-import

As during the import, you write-mostly, you could disable the high level
object cache by adding this setting to your `conf/neo4j.properties`

cache_type=none

You can start the neo4j-shell either against a server which uses that
config file.

Ok, I wanted to try it out myself, so I generated a 12M row CSV file using
the simple test-data-generator from my batch-importer,
which should be similar to your file:

$ wc -l rels.csv
12594008 rels.csv

$ head -5 rels.csv
Start Ende Type Property Counter:long
239882 139576 TWO Property 1
456809 629052 ONE Property 2
611043 767051 TEN Property 3
181428 1012101 NINE Property 4

I used neo4j-shell with a heap size of 12G and young generation size of 2G.
I edited bin/neo4j-shell to add it to

EXTRA_JVM_ARGUMENTS="-Xmx12G -Xms12G -Xmn2G"

Then I upped the mmio limits in conf/neo4j.properties and disabled the
cache.

cache_type=none
# Default values for the low-level graph engine
neostore.nodestore.db.mapped_memory=250M
neostore.relationshipstore.db.mapped_memory=5G
neostore.propertystore.db.mapped_memory=1G
neostore.propertystore.db.strings.mapped_memory=500M

I started the neo4j-shell using that config file:

$ neo4j/bin/neo4j-shell -config ~/neo4j/conf/neo4j.properties -path test.db

First I imported all nodes in position 0, using DISTINCT and CREATE instead
of MERGE

USING PERIODIC COMMIT 50000
LOAD CSV FROM 'file:///home/michael/import/batch-import/rels.csv' AS line
FIELDTERMINATOR '\t'
WITH line
SKIP 1
WITH distinct line[0] as name
CREATE (:Node {name:name});

Took 40s to create almost 12M nodes.

Nodes created: 1199966
40484 ms

Running the same statement with a constraint and MERGE does not finish for
me.

Only then I added the constraint, which took 6s.

create constraint on (n:Node) assert n.name is unique;

And used DISTINCT and MERGE to create the nodes in position 1, taking 56s.

USING PERIODIC COMMIT 50000
LOAD CSV FROM 'file:///home/michael/import/batch-import/rels.csv' AS line
FIELDTERMINATOR '\t'
WITH line
SKIP 1
WITH distinct line[1] as name
MERGE (:Node {name:name});

Then I chunked up the 12M line file into 1M line segments to import the
relationships, each of the 12 runs takes about 45s.

export skip=0

LOAD CSV FROM 'file:///home/michael/import/batch-import/rels.csv' AS line
FIELDTERMINATOR '\t'
WITH line
SKIP {skip} LIMIT 1000000
MATCH (node1:Node {name:line[0]})
MATCH (node2:Node {name:line[1]})
CREATE (node1)-[:REL {count:toInt(line[4])}]->(node2);

export skip=1000000

... repeat until no rels are created anymore ....

This generated a database directory of 1.5GB:

$ du -sh test.db/
1.5G test.db/

Which contains 12M nodes and 12.5 M rels.

match (n) return count(*);
+----------+
| count(*) |
+----------+
| 1200000  |
+----------+
1 row
623 ms

start r=rel(*) return count(*);
+----------+
| count(*) |
+----------+
| 12594007 |
+----------+
1 row
6187 ms

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] Constantly growing heap / out of memory exception when loading CSV file

Reply via email to