Re: [Neo4j] Re: Load very large CSV into Neo4j

mohsen Mon, 15 Dec 2014 20:54:38 -0800

Here are statistics of loading data:























*Nodes[INPUT----------|NODE--------------------------------------------------------------|PROP|WRITER]
 
87MDone in 15m 11s 91msCalculate dense 
nodes[INPUT--------------|PREPARE(2)=============================================================|CA]114MDone
 
in 18m 18s 
880msRelationships[INPUT--------------|PREPARE(2)=========================================================|REL]114M14MDone
 
in 18m 46s 226msNode first 
rel[LINKER----------------------------------------------------------------------------------------]
 
84MDone in 1m 1s 629msRelationship back 
link[LINKER----------------------------------------------------------------------------------------]113MDone
 
in 2m 9s 3msNode counts[NODE 
COUNTS-----------------------------------------------------------------------------------]
 
75MDone in 12s 906msRelationship counts[RELATIONSHIP 
COUNTS---------------------------------------------------------------------------]113MDone
 
in 38s 374msIMPORT DONE in 56m 25s 25ms*

On Monday, December 15, 2014 8:53:34 PM UTC-8, mohsen wrote:
>
> With help of Michael, I could finally load the data using batch-import 
> command in Neo4j 2.2.0-M02. It took totally 56 minutes. The issue 
> preventing Neo4j from loading the CSV files was having *\"* in some 
> values, which was interpreted as a quotation character to be included in 
> the field value and this was messing up everything from this point forward. 
>
> On Friday, December 12, 2014 2:33:14 PM UTC-8, mohsen wrote:
>>
>> Michael, I sent you a separate email with credentials to access the csv 
>> files. Thanks.
>>
>> On Friday, December 12, 2014 12:26:51 PM UTC-8, mohsen wrote:
>>>
>>> Thanks for sharing the new version. Here are my memory info before 
>>> running batch-import: 
>>>
>>> Mem:  18404972k total,   549848k used, 17855124k free,    12524k buffers
>>>> Swap:  4063224k total,        0k used,  4063224k free,   211284k cached
>>>
>>>
>>> I assigned 11G for heap:  export JAVA_OPTS="$JAVA_OPTS -Xmx11G"
>>> I ran the batch-import at 11:13am, now it is 12:20pm and it seems that 
>>> it is stuck. Here is the log: 
>>>
>>> Nodes
>>>> [INPUT-------------------|NODE-------------------------------------------------|PROP|WRITER:
>>>>  
>>>> W:] 86M
>>>> Done in 15m 21s 150ms
>>>> Calculate dense nodes
>>>> [INPUT---------|PREPARE(2)====================================================================|]
>>>>  
>>>>   0
>>>
>>>
>>> And this is my memory info right now:
>>>
>>>> top - 12:22:43 up  1:34,  3 users,  load average: 0.00, 0.00, 0.00
>>>> Tasks: 134 total,   1 running, 133 sleeping,   0 stopped,   0 zombie
>>>> Cpu(s):  0.3%us,  0.5%sy,  0.0%ni, 99.2%id,  0.0%wa,  0.0%hi,  0.0%si,  
>>>> 0.0%st
>>>> Mem:  18404972k total, 18244612k used,   160360k free,     6132k buffers
>>>> Swap:  4063224k total,        0k used,  4063224k free, 14089236k cached
>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND    
>>>>                                                                            
>>>>  
>>>>                    
>>>>  4496 root      20   0 7598m 3.4g  15m S  3.3 19.4  20:35.88 java       
>>>>  
>>>
>>> It's been more than 40 minutes that it is stuck in Calculate Dense 
>>> Nodes. Should I wait for that? or I need to kill the process?
>>>
>>>  
>>>
>>> On Friday, December 12, 2014 3:13:15 AM UTC-8, Michael Hunger wrote:
>>>
>>>> Right, that's the problem with an RDF model why only uses relationships 
>>>> to represent properties, you won't get the performance that you would get 
>>>> with a real property-graph model.
>>>>
>>>> I share the version separately.
>>>>
>>>> Cheers, Michael
>>>>
>>>> On Fri, Dec 12, 2014 at 12:07 PM, mohsen <[email protected]> wrote:
>>>>
>>>>> I appreciate if you get me the newer version, I am already using 
>>>>> 2.2.0-M01. 
>>>>>
>>>>> I want to run some graph queries over my rdf. First, I loaded my data 
>>>>> into Virtuoso triple store (took 2-3 hours), but could not get results 
>>>>> for 
>>>>> my SPARQL queries in a reasonable time. That is the reason I decided to 
>>>>> load my data into Neo4j to be able to run my queries.
>>>>>
>>>>> I am only importing RDF to Neo4j only for a specific research problem. 
>>>>> I need to extract some patterns from the rdf data and I have to write 
>>>>> queries that require some sort of graph traversal. I don't want to do 
>>>>> reasoning over my rdf data. The graph structure looks simple: nodes only 
>>>>> have Label (Uri or Literal) and Value, and relationships don't have any 
>>>>> property. 
>>>>>
>>>>> On Friday, December 12, 2014 2:41:36 AM UTC-8, Michael Hunger wrote:
>>>>>>
>>>>>> >our id's are UUIDs or ? so 36 chars * 90M -> 72 bytes and Neo-Id's 
>>>>>> are longs w/ 8 bytes. so 80 bytes per entry.
>>>>>> Should allocate about 6G heap.
>>>>>>
>>>>>> Btw. importing RDF 1:1 into Neo4j is no good idea in the first place.
>>>>>>
>>>>>> You should model a clean property graph model and import INTO that 
>>>>>> model.
>>>>>>
>>>>>> The the batch-import, it's a bug that has been fixed after the 
>>>>>> milestone, I try to get you a newer version to try.
>>>>>>
>>>>>> Cheers, Michael
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Dec 12, 2014 at 11:26 AM, mohsen <[email protected]> wrote:
>>>>>>
>>>>>>> Thanks Michael for following my problem. In groovy script, the 
>>>>>>> output was still with nodes. It is not feasible to use enum for 
>>>>>>> relationshipTypes, types are URIs of ontology predicates coming from 
>>>>>>> CSV 
>>>>>>> file, and there are many of them. However, I think the problem is that 
>>>>>>> this 
>>>>>>> script requires more than 10GB heap, because it needs to store the 
>>>>>>> nodes in 
>>>>>>> memory (map) to use them later for creating relationships. So, I guess 
>>>>>>> even 
>>>>>>> reducing mmio mapping size won't solve the problem, will try it though 
>>>>>>> tomorrow.
>>>>>>>
>>>>>>> Regarding the batch-import command, do you have any idea why I am 
>>>>>>> getting that error? 
>>>>>>>
>>>>>>> On Friday, December 12, 2014 1:40:56 AM UTC-8, Michael Hunger wrote:
>>>>>>>>
>>>>>>>> It would have been good if you had taken a thread dump from the 
>>>>>>>> groovy script.
>>>>>>>>
>>>>>>>> but if you look at the memory:
>>>>>>>>
>>>>>>>> off heap = 2+2+1+1 => 6
>>>>>>>> heap = 10
>>>>>>>> leaves nothing for OS
>>>>>>>>
>>>>>>>> probably the heap gc's as well.
>>>>>>>>
>>>>>>>> So you have to reduce the mmio mapping size
>>>>>>>>
>>>>>>>> Was the output still with nodes or already rels?
>>>>>>>>
>>>>>>>> Perhaps also replace DynamicRelationshipType.withName(line.Type) 
>>>>>>>> with an enum
>>>>>>>>
>>>>>>>> you can also extend trace to output number of nodes and rels
>>>>>>>>
>>>>>>>> Would you be able to share your csv files?
>>>>>>>>
>>>>>>>> Michael
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Dec 12, 2014 at 10:08 AM, mohsen <[email protected]> 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I could not load the data using Groovy too. I increased groovy 
>>>>>>>>> heap size to 10G before running the script (using JAVA_OPTS). My 
>>>>>>>>> machine 
>>>>>>>>> has 16G of RAM. It halts when it loads 41M rows from nodes.csv:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> log: 
>>>>>>>>> ....
>>>>>>>>> 41200000 rows 38431 ms
>>>>>>>>> 41300000 rows 50988 ms 
>>>>>>>>> 41400000 rows 63747 ms 
>>>>>>>>> 41500000 rows 112758 ms 
>>>>>>>>> 41600000 rows 326497 ms
>>>>>>>>>
>>>>>>>>> After logging 41,600,000 rows, nothing happened. I waited 2 hours 
>>>>>>>>> there was not any progress. The process was still taking CPU but 
>>>>>>>>> there was 
>>>>>>>>> NOT any free memory at that time. I guess that's the reason for that. 
>>>>>>>>> I 
>>>>>>>>> have attached my groovy script where you can find the memory 
>>>>>>>>> configurations. I guess something goes wrong with memory since it 
>>>>>>>>> stopped 
>>>>>>>>> when all my system's memory was used.
>>>>>>>>>
>>>>>>>>> I then switched back to batch-import tool with stacktrace. I think 
>>>>>>>>> the error I got last time was due to small heap size because I did 
>>>>>>>>> not get 
>>>>>>>>> that error this time (after allocating 10GB heap). Anyway, I have 
>>>>>>>>> exactly 86983375 
>>>>>>>>> nodes and it could load the nodes this time, but I got another error: 
>>>>>>>>>  
>>>>>>>>>
>>>>>>>>>  Nodes
>>>>>>>>>
>>>>>>>>> [INPUT-------------|ENCODER-----------------------------------------|WRITER]
>>>>>>>>>  
>>>>>>>>>> 86M
>>>>>>>>>
>>>>>>>>> Calculate dense nodes
>>>>>>>>>> Import error: InputRelationship:
>>>>>>>>>>    properties: []
>>>>>>>>>>    startNode: file:///Users/mohsen/Desktop/M
>>>>>>>>>> usic%20RDF/echonest/analyze-example.rdf#signal
>>>>>>>>>>    endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1
>>>>>>>>>>    type: http://purl.org/ontology/echonest/beatVariance 
>>>>>>>>>> specified start node that hasn't been imported
>>>>>>>>>> java.lang.RuntimeException: InputRelationship:
>>>>>>>>>>    properties: []
>>>>>>>>>>    startNode: file:///Users/mohsen/Desktop/M
>>>>>>>>>> usic%20RDF/echonest/analyze-example.rdf#signal
>>>>>>>>>>    endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1
>>>>>>>>>>    type: http://purl.org/ontology/echonest/beatVariance 
>>>>>>>>>> specified start node that hasn't been imported
>>>>>>>>>> at org.neo4j.unsafe.impl.batchimport.staging.StageExecution.
>>>>>>>>>> stillExecuting(StageExecution.java:54)
>>>>>>>>>> at org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMo
>>>>>>>>>> nitor.anyStillExecuting(PollingExecutionMonitor.java:71)
>>>>>>>>>> at org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMo
>>>>>>>>>> nitor.finishAwareSleep(PollingExecutionMonitor.java:94)
>>>>>>>>>> at org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMo
>>>>>>>>>> nitor.monitor(PollingExecutionMonitor.java:62)
>>>>>>>>>> at org.neo4j.unsafe.impl.batchimport.ParallelBatchImporter.exec
>>>>>>>>>> uteStages(ParallelBatchImporter.java:221)
>>>>>>>>>> at org.neo4j.unsafe.impl.batchimport.ParallelBatchImporter.doIm
>>>>>>>>>> port(ParallelBatchImporter.java:139)
>>>>>>>>>> at org.neo4j.tooling.ImportTool.main(ImportTool.java:212)
>>>>>>>>>> Caused by: org.neo4j.unsafe.impl.batchimport.input.InputException: 
>>>>>>>>>> InputRelationship:
>>>>>>>>>>    properties: []
>>>>>>>>>>    startNode: file:///Users/mohsen/Desktop/M
>>>>>>>>>> usic%20RDF/echonest/analyze-example.rdf#signal
>>>>>>>>>>    endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1
>>>>>>>>>>    type: http://purl.org/ontology/echonest/beatVariance 
>>>>>>>>>> specified start node that hasn't been imported
>>>>>>>>>> at org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.en
>>>>>>>>>> sureNodeFound(CalculateDenseNodesStep.java:95)
>>>>>>>>>> at org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.pr
>>>>>>>>>> ocess(CalculateDenseNodesStep.java:61)
>>>>>>>>>> at org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.pr
>>>>>>>>>> ocess(CalculateDenseNodesStep.java:38)
>>>>>>>>>> at org.neo4j.unsafe.impl.batchimport.staging.ExecutorServiceSte
>>>>>>>>>> p$2.run(ExecutorServiceStep.java:81)
>>>>>>>>>> at java.util.concurrent.Executors$RunnableAdapter.call(
>>>>>>>>>> Executors.java:471)
>>>>>>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>>>>>>>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>>>>>>>>> Executor.java:1145)
>>>>>>>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>>>>>>>>> lExecutor.java:615)
>>>>>>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>>>>>> at org.neo4j.helpers.NamedThreadFactory$2.run(NamedThreadFactor
>>>>>>>>>> y.java:99)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> It seems that it cannot find the start and end node of a 
>>>>>>>>> relationships. However, both nodes exist in nodes.csv (I did a grep 
>>>>>>>>> to be 
>>>>>>>>> sure). So, I don't know what goes wrong. Do you have any idea? Can it 
>>>>>>>>> be 
>>>>>>>>> related to the id of the start node "file:///Users/mohsen/Desktop/
>>>>>>>>> Music%20RDF/echonest/analyze-example.rdf#signal"?
>>>>>>>>> On Thursday, December 11, 2014 10:02:05 PM UTC-8, Michael Hunger 
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> The groovy one should work fine too. I wanted to augment the post 
>>>>>>>>>> with one that has @CompileStatic so that it's faster. 
>>>>>>>>>>
>>>>>>>>>> I'd be also interested in the --stacktraces output of the 
>>>>>>>>>> batch-import tool of Neo4j 2.2, perhaps you can let it run over 
>>>>>>>>>> night or in 
>>>>>>>>>> the background.
>>>>>>>>>>
>>>>>>>>>> Cheers, Michael
>>>>>>>>>>
>>>>>>>>>> On Fri, Dec 12, 2014 at 3:34 AM, mohsen <[email protected]> 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I guess the core code for both batch-import and Load CSV is the 
>>>>>>>>>>> same, why do you think running it from Cypher (rather than through 
>>>>>>>>>>> batch-import) helps? I am trying groovy and batch-inserter 
>>>>>>>>>>> <https://gist.github.com/jexp/0617412dcdd644fd520b#file-import_kaggle-groovy>
>>>>>>>>>>>  now, 
>>>>>>>>>>> will post how it goes.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thursday, December 11, 2014 5:44:36 AM UTC-8, Andrii Stesin 
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> I'd suggest you take a look at last 5-7 posts in this recent 
>>>>>>>>>>>> thread 
>>>>>>>>>>>> <https://groups.google.com/forum/#!topic/neo4j/jSFtnD5OHxg>. 
>>>>>>>>>>>> You don't basically need any "batch import" command - I'd suggest 
>>>>>>>>>>>> you to 
>>>>>>>>>>>> use just a plain LOAD CSV functionality from Cypher, and you will 
>>>>>>>>>>>> just fill 
>>>>>>>>>>>> your database step by step.
>>>>>>>>>>>>
>>>>>>>>>>>> WBR,
>>>>>>>>>>>> Andrii
>>>>>>>>>>>>
>>>>>>>>>>>  -- 
>>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>> Google Groups "Neo4j" group.
>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>>>>> it, send an email to [email protected].
>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  -- 
>>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>>> Groups "Neo4j" group.
>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>>> send an email to [email protected].
>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>
>>>>>>>>
>>>>>>>>  -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "Neo4j" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to [email protected].
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>>  -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "Neo4j" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] Re: Load very large CSV into Neo4j

Reply via email to