And here the problem arrives... The same input .csv file described above,
nodes imported, *now establishing relationships*. Second pass, the same
file. The query is pretty simple:
//
// I *do* know that Neo4j is very sensitive with batch import of
relationships, so 50-100 is more than enough
//
USING PERIODIC COMMIT 50
LOAD CSV FROM "file:///home/stesin/Documents/step1/3_generic/50_main.csv"
AS row
MATCH
(src:LeftLabel { p1: row[0] }),
(dst:RightLabel { p2: row[2], coll1: SPLIT(row[3], ':'), future_labels_coll:
SPLIT(row[4], ':') })
USING INDEX src:LeftLabel(p1)
USING INDEX dst:RightLabel(p2)
MERGE
(src)-[r:IS_RELATED_TO { coll3: SPLIT(row[1], ':') }]->(dst)
ON CREATE SET
...just SETting a bunch of properties r.some_payload here...
;
Profiler ensured me, that both *src* and *dst* nodes will be accessed via
fast looking up in corresponding indices. Ok so far.
I freed all my 16Gb RAM almost completely, configured 4Gb for neo4j JVM
heap, increased MMIO buffers x2,5, made sure that there isn't any
auto-indexing.
Than I opened http://localhost:7474/webadmin/# and fed the script to
neo4j-shell
First time the import attempt stuck when webadmin reported some 380000+
relationships in the db. No diagnostics, complete silence in logs, but
neo4j server just stopped responding both to webadmin and to new shell
connections. After I restarted the server (it was shut down brutally after
being unable to close gracefully) webadmin told me that there are about
16k+ relationships in the db.
Tried once more with the same result, it just went south much earlier, at
about 190k+ relationships.
So, this is the very same behavior we observed with early 2.1 releases! So
what I did - just an old proven trick:
bash$ split -a 3 -l 10000 50_main.csv 50_main.
Got a bunch of 10000-liner files named from 50_main.aaa to 50_main.ajo - a
total of 249 files,
than wrote a bash script:
#!/bin/bash
for file_name in `echo 50_main.a*`
do
echo -n "${file_name} ... "
qry="USING PERIODIC COMMIT 50"
qry="$qry LOAD CSV FROM
\"file:///home/stesin/Documents/step1/3_generic/${file_name}\" AS row"
qry="$qry MATCH"
...
# ... line by line I built the whole query...
neo4j-shell -c "$qry" || exit 1
echo -n "sleeping... "
sleep 5
echo "done."
done
This worked: each 10000-liner file took between 1600 and 2600 ms (say ~8
complete seconds per file including sleep 5) and the whole import completed
Ok in some half an hour.
I think that this may be some kind of TCP/IP (HTTP) session timeout issue
-- when some operation (transaction) takes too long, one side closes his
end of the session while the other side (server?) continues to think that
the session is still up.
Otherwise why the very same data, cut into 10000-line pieces, imports Ok -
but the complete file does not?
WBR,
Andrii
--
You received this message because you are subscribed to the Google Groups
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.