[Neo4j] Re: LOAD CSV on bulk, performance

gg4u Tue, 12 Aug 2014 15:12:55 -0700

Hello,

I am following the advice of Rik,
https://github.com/jexp/batch-import/tree/20
it is real promising!


I have still issues when using my own custom csv with the batch importer:
Exception in thread "main" org.neo4j.graphdb.NotFoundException: id=39
at 
org.neo4j.unsafe.batchinsert.BatchInserterImpl.getNodeRecord(BatchInserterImpl.java:1215)
at 
org.neo4j.unsafe.batchinsert.BatchInserterImpl.createRelationship(BatchInserterImpl.java:777)
at org.neo4j.batchimport.Importer.importRelationships(Importer.java:154)
at org.neo4j.batchimport.Importer.doImport(Importer.java:232)
at org.neo4j.batchimport.Importer.main(Importer.java:83)


I think i am closer to this mega-import! (Hope really so :P)
Could you please help in figuring out what  may the problem be?

My hypothesis

1. I thought it is because it cannot find a node, while it is written as 
start/end of a relationships.

So I checked my nodes.csv and rel.csv, make trivial files with two nodes 
and one relationships, but still got the error.

2. on the batch importer documention, it is written that 

   - have to know max # of rels per node, properties per node and 
   relationship

where and how should this be specified? in nodes.csv or rels.csv?
Does the number of relationships be specified in the column 'rels' of 
nodes.csv as in the test.db example?
But it is not written in the documention example on git. I'm confused!

3. The documention paragraph about schema index is not clear to me: does it 
means I can use the files node.csv and rels.csv used for the test.db, and 
modify the header and batch.properties file according to my own custom 
structure?

What does *counter:int *property refer to?

Here what i've done!

1. headers in nodes.csv and rels.csv
Nodes.csv headers:
id:int *mynamelabel:label* name:string:*mynodeindex*

Rels.csv headers
id:int id:int type proximity counter:int

2. indexes
I want to use my own indexes:
node.id  (specified as int) are unique *but not in progressive order.*
Is it an issue?
E.g. my nodes' list is like:
node.id property
25 mark
39 julie

What is the difference between an exact index and a fulltext index?

3. My batch.importer

dump_configuration=false
cache_type=none
use_memory_mapped_buffers=true
neostore.propertystore.db.index.keys.mapped_memory=5M
neostore.propertystore.db.index.mapped_memory=5M
# 14 bytes per node
neostore.nodestore.db.mapped_memory=200M
# 33 bytes per relationships
neostore.relationshipstore.db.mapped_memory=4G
# 38 bytes per property
neostore.propertystore.db.mapped_memory=200M
neostore.propertystore.db.strings.mapped_memory=500M
batch_array_separator=,
#batch_import.csv.quotes=true
#batch_import.csv.delim=,
batch_import.keep_db=true
#
batch_import.node_index.*mynodeindex=**exact*
batch_import.node_index.*id=**exact*
batch_import.node_index.*node_auto_index=exact*

P.s. 

Once the db is loaded in neo, are constraint on properties and indexes 
already present?
I was trying to match a node on test.db, but could not find it with simple 
query
MATCH (a {label:254782})-[r]-b Return r Limit 25
and the query takes a very long time to compute, making me suspect if the 
indexes were properly created.


Really thank you for your help!

 
 



Il giorno martedì 12 agosto 2014 19:44:04 UTC+2, gg4u ha scritto:
>
> Hi Rik!
>
> ...in minutes?
>
> I'd like to understand how I could get closer to that result, though I 
> will try also that library.
>
> that's kind of strange for me, cause both using the LOAD CSV functionality 
> from shell, both doing a transaction each time, it looks like I run into a 
> memory heap problem.
>
> Why the batch import from shell should be so slower than the batch-import 
> script?
>
> Also, I see the importer is flexible enough, but my custom file (adjacnecy 
> list to avoid redundancy) is more than 1GB; if I expand it and make a csv 
> full of redundancy of node-rel-neighbor1, node-rel-neighbor2, it will be 
> much much bigger and i am worried if it could be handled.
>
> A question:
> in rel.csv  (https://github.com/jexp/batch-import/tree/20)
> i read node-id start from 0.
>
> Are they temporary id or mandatory?
> E.g. what if I would like to upload another subgraph in the same db with 
> the batch importer (clearly without overriding the nodes) ?
>
>
>
>
>
>
> Il giorno martedì 12 agosto 2014 18:46:00 UTC+2, Rik Van Bruggen ha 
> scritto:
>>
>> I think you should use the batch importer for this size of a graph. You 
>> will be done in minutes, not hours.
>>
>> https://github.com/jexp/batch-import/tree/20
>>
>> Rik
>>
>> On Tuesday, August 12, 2014 5:13:39 PM UTC+1, gg4u wrote:
>>>
>>> Hello,
>>>
>>> here i am trying to upload a massive network:
>>> 4M nodes, 100M correlations.
>>>
>>> having problems of memory and perfomance, I'd like to know if I am doing 
>>> it OK:
>>>
>>> 1.
>>> Before loading the correlations, I wanted to load the nodes.
>>>
>>> 2. Set up neo4-wrapper and neo4j.properties as written in 
>>> http://www.neo4j.org/graphgist?d788e117129c3730a042
>>>
>>> with JVM heap set at 4096Mb
>>>
>>> with this setting, bulk on 4M nodes failed.
>>>
>>> 3. Raised memory min-heap and max-heap to 6144Mb
>>> Run a test with 100K nodes.
>>>
>>> I got:
>>> Nodes created: 98991
>>> Properties set: 197982
>>> Labels added: 98991
>>> 3438685 ms
>>>
>>> Almost an hour for uploading 100K nodes with two properties?
>>> I thought it should be much faster.
>>>
>>> Am I doing smtg wrong?
>>> this is the importer code I used:
>>>
>>> CREATE CONSTRAINT ON (n:MYNODES) ASSERT n.id IS UNIQUE;
>>> CREATE INDEX ON : n:MYNODES(name);
>>>
>>> USING PERIODIC COMMIT 1000
>>> LOAD CSV WITH HEADERS FROM 'file:///blablabla.csv' AS line 
>>>  FIELDTERMINATOR '\t' 
>>> WITH line, toInt(line.topicId) as id, line.name as name* LIMIT 100000*
>>> MERGE (n:MYNODES { id: id, name: name });
>>>
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[Neo4j] Re: LOAD CSV on bulk, performance

Reply via email to