Here are statistics of loading data:
*Nodes[INPUT----------|NODE--------------------------------------------------------------|PROP|WRITER] 87MDone in 15m 11s 91msCalculate dense nodes[INPUT--------------|PREPARE(2)=============================================================|CA]114MDone in 18m 18s 880msRelationships[INPUT--------------|PREPARE(2)=========================================================|REL]114M14MDone in 18m 46s 226msNode first rel[LINKER----------------------------------------------------------------------------------------] 84MDone in 1m 1s 629msRelationship back link[LINKER----------------------------------------------------------------------------------------]113MDone in 2m 9s 3msNode counts[NODE COUNTS-----------------------------------------------------------------------------------] 75MDone in 12s 906msRelationship counts[RELATIONSHIP COUNTS---------------------------------------------------------------------------]113MDone in 38s 374msIMPORT DONE in 56m 25s 25ms* On Monday, December 15, 2014 8:53:34 PM UTC-8, mohsen wrote: > > With help of Michael, I could finally load the data using batch-import > command in Neo4j 2.2.0-M02. It took totally 56 minutes. The issue > preventing Neo4j from loading the CSV files was having *\"* in some > values, which was interpreted as a quotation character to be included in > the field value and this was messing up everything from this point forward. > > On Friday, December 12, 2014 2:33:14 PM UTC-8, mohsen wrote: >> >> Michael, I sent you a separate email with credentials to access the csv >> files. Thanks. >> >> On Friday, December 12, 2014 12:26:51 PM UTC-8, mohsen wrote: >>> >>> Thanks for sharing the new version. Here are my memory info before >>> running batch-import: >>> >>> Mem: 18404972k total, 549848k used, 17855124k free, 12524k buffers >>>> Swap: 4063224k total, 0k used, 4063224k free, 211284k cached >>> >>> >>> I assigned 11G for heap: export JAVA_OPTS="$JAVA_OPTS -Xmx11G" >>> I ran the batch-import at 11:13am, now it is 12:20pm and it seems that >>> it is stuck. Here is the log: >>> >>> Nodes >>>> [INPUT-------------------|NODE-------------------------------------------------|PROP|WRITER: >>>> >>>> W:] 86M >>>> Done in 15m 21s 150ms >>>> Calculate dense nodes >>>> [INPUT---------|PREPARE(2)====================================================================|] >>>> >>>> 0 >>> >>> >>> And this is my memory info right now: >>> >>>> top - 12:22:43 up 1:34, 3 users, load average: 0.00, 0.00, 0.00 >>>> Tasks: 134 total, 1 running, 133 sleeping, 0 stopped, 0 zombie >>>> Cpu(s): 0.3%us, 0.5%sy, 0.0%ni, 99.2%id, 0.0%wa, 0.0%hi, 0.0%si, >>>> 0.0%st >>>> Mem: 18404972k total, 18244612k used, 160360k free, 6132k buffers >>>> Swap: 4063224k total, 0k used, 4063224k free, 14089236k cached >>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >>>> >>>> >>>> >>>> 4496 root 20 0 7598m 3.4g 15m S 3.3 19.4 20:35.88 java >>>> >>> >>> It's been more than 40 minutes that it is stuck in Calculate Dense >>> Nodes. Should I wait for that? or I need to kill the process? >>> >>> >>> >>> On Friday, December 12, 2014 3:13:15 AM UTC-8, Michael Hunger wrote: >>> >>>> Right, that's the problem with an RDF model why only uses relationships >>>> to represent properties, you won't get the performance that you would get >>>> with a real property-graph model. >>>> >>>> I share the version separately. >>>> >>>> Cheers, Michael >>>> >>>> On Fri, Dec 12, 2014 at 12:07 PM, mohsen <[email protected]> wrote: >>>> >>>>> I appreciate if you get me the newer version, I am already using >>>>> 2.2.0-M01. >>>>> >>>>> I want to run some graph queries over my rdf. First, I loaded my data >>>>> into Virtuoso triple store (took 2-3 hours), but could not get results >>>>> for >>>>> my SPARQL queries in a reasonable time. That is the reason I decided to >>>>> load my data into Neo4j to be able to run my queries. >>>>> >>>>> I am only importing RDF to Neo4j only for a specific research problem. >>>>> I need to extract some patterns from the rdf data and I have to write >>>>> queries that require some sort of graph traversal. I don't want to do >>>>> reasoning over my rdf data. The graph structure looks simple: nodes only >>>>> have Label (Uri or Literal) and Value, and relationships don't have any >>>>> property. >>>>> >>>>> On Friday, December 12, 2014 2:41:36 AM UTC-8, Michael Hunger wrote: >>>>>> >>>>>> >our id's are UUIDs or ? so 36 chars * 90M -> 72 bytes and Neo-Id's >>>>>> are longs w/ 8 bytes. so 80 bytes per entry. >>>>>> Should allocate about 6G heap. >>>>>> >>>>>> Btw. importing RDF 1:1 into Neo4j is no good idea in the first place. >>>>>> >>>>>> You should model a clean property graph model and import INTO that >>>>>> model. >>>>>> >>>>>> The the batch-import, it's a bug that has been fixed after the >>>>>> milestone, I try to get you a newer version to try. >>>>>> >>>>>> Cheers, Michael >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Dec 12, 2014 at 11:26 AM, mohsen <[email protected]> wrote: >>>>>> >>>>>>> Thanks Michael for following my problem. In groovy script, the >>>>>>> output was still with nodes. It is not feasible to use enum for >>>>>>> relationshipTypes, types are URIs of ontology predicates coming from >>>>>>> CSV >>>>>>> file, and there are many of them. However, I think the problem is that >>>>>>> this >>>>>>> script requires more than 10GB heap, because it needs to store the >>>>>>> nodes in >>>>>>> memory (map) to use them later for creating relationships. So, I guess >>>>>>> even >>>>>>> reducing mmio mapping size won't solve the problem, will try it though >>>>>>> tomorrow. >>>>>>> >>>>>>> Regarding the batch-import command, do you have any idea why I am >>>>>>> getting that error? >>>>>>> >>>>>>> On Friday, December 12, 2014 1:40:56 AM UTC-8, Michael Hunger wrote: >>>>>>>> >>>>>>>> It would have been good if you had taken a thread dump from the >>>>>>>> groovy script. >>>>>>>> >>>>>>>> but if you look at the memory: >>>>>>>> >>>>>>>> off heap = 2+2+1+1 => 6 >>>>>>>> heap = 10 >>>>>>>> leaves nothing for OS >>>>>>>> >>>>>>>> probably the heap gc's as well. >>>>>>>> >>>>>>>> So you have to reduce the mmio mapping size >>>>>>>> >>>>>>>> Was the output still with nodes or already rels? >>>>>>>> >>>>>>>> Perhaps also replace DynamicRelationshipType.withName(line.Type) >>>>>>>> with an enum >>>>>>>> >>>>>>>> you can also extend trace to output number of nodes and rels >>>>>>>> >>>>>>>> Would you be able to share your csv files? >>>>>>>> >>>>>>>> Michael >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Dec 12, 2014 at 10:08 AM, mohsen <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I could not load the data using Groovy too. I increased groovy >>>>>>>>> heap size to 10G before running the script (using JAVA_OPTS). My >>>>>>>>> machine >>>>>>>>> has 16G of RAM. It halts when it loads 41M rows from nodes.csv: >>>>>>>>> >>>>>>>>> >>>>>>>>> log: >>>>>>>>> .... >>>>>>>>> 41200000 rows 38431 ms >>>>>>>>> 41300000 rows 50988 ms >>>>>>>>> 41400000 rows 63747 ms >>>>>>>>> 41500000 rows 112758 ms >>>>>>>>> 41600000 rows 326497 ms >>>>>>>>> >>>>>>>>> After logging 41,600,000 rows, nothing happened. I waited 2 hours >>>>>>>>> there was not any progress. The process was still taking CPU but >>>>>>>>> there was >>>>>>>>> NOT any free memory at that time. I guess that's the reason for that. >>>>>>>>> I >>>>>>>>> have attached my groovy script where you can find the memory >>>>>>>>> configurations. I guess something goes wrong with memory since it >>>>>>>>> stopped >>>>>>>>> when all my system's memory was used. >>>>>>>>> >>>>>>>>> I then switched back to batch-import tool with stacktrace. I think >>>>>>>>> the error I got last time was due to small heap size because I did >>>>>>>>> not get >>>>>>>>> that error this time (after allocating 10GB heap). Anyway, I have >>>>>>>>> exactly 86983375 >>>>>>>>> nodes and it could load the nodes this time, but I got another error: >>>>>>>>> >>>>>>>>> >>>>>>>>> Nodes >>>>>>>>> >>>>>>>>> [INPUT-------------|ENCODER-----------------------------------------|WRITER] >>>>>>>>> >>>>>>>>>> 86M >>>>>>>>> >>>>>>>>> Calculate dense nodes >>>>>>>>>> Import error: InputRelationship: >>>>>>>>>> properties: [] >>>>>>>>>> startNode: file:///Users/mohsen/Desktop/M >>>>>>>>>> usic%20RDF/echonest/analyze-example.rdf#signal >>>>>>>>>> endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1 >>>>>>>>>> type: http://purl.org/ontology/echonest/beatVariance >>>>>>>>>> specified start node that hasn't been imported >>>>>>>>>> java.lang.RuntimeException: InputRelationship: >>>>>>>>>> properties: [] >>>>>>>>>> startNode: file:///Users/mohsen/Desktop/M >>>>>>>>>> usic%20RDF/echonest/analyze-example.rdf#signal >>>>>>>>>> endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1 >>>>>>>>>> type: http://purl.org/ontology/echonest/beatVariance >>>>>>>>>> specified start node that hasn't been imported >>>>>>>>>> at org.neo4j.unsafe.impl.batchimport.staging.StageExecution. >>>>>>>>>> stillExecuting(StageExecution.java:54) >>>>>>>>>> at org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMo >>>>>>>>>> nitor.anyStillExecuting(PollingExecutionMonitor.java:71) >>>>>>>>>> at org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMo >>>>>>>>>> nitor.finishAwareSleep(PollingExecutionMonitor.java:94) >>>>>>>>>> at org.neo4j.unsafe.impl.batchimport.staging.PollingExecutionMo >>>>>>>>>> nitor.monitor(PollingExecutionMonitor.java:62) >>>>>>>>>> at org.neo4j.unsafe.impl.batchimport.ParallelBatchImporter.exec >>>>>>>>>> uteStages(ParallelBatchImporter.java:221) >>>>>>>>>> at org.neo4j.unsafe.impl.batchimport.ParallelBatchImporter.doIm >>>>>>>>>> port(ParallelBatchImporter.java:139) >>>>>>>>>> at org.neo4j.tooling.ImportTool.main(ImportTool.java:212) >>>>>>>>>> Caused by: org.neo4j.unsafe.impl.batchimport.input.InputException: >>>>>>>>>> InputRelationship: >>>>>>>>>> properties: [] >>>>>>>>>> startNode: file:///Users/mohsen/Desktop/M >>>>>>>>>> usic%20RDF/echonest/analyze-example.rdf#signal >>>>>>>>>> endNode: 82A4CB6E-7250-1634-DBB8-0297C5259BB1 >>>>>>>>>> type: http://purl.org/ontology/echonest/beatVariance >>>>>>>>>> specified start node that hasn't been imported >>>>>>>>>> at org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.en >>>>>>>>>> sureNodeFound(CalculateDenseNodesStep.java:95) >>>>>>>>>> at org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.pr >>>>>>>>>> ocess(CalculateDenseNodesStep.java:61) >>>>>>>>>> at org.neo4j.unsafe.impl.batchimport.CalculateDenseNodesStep.pr >>>>>>>>>> ocess(CalculateDenseNodesStep.java:38) >>>>>>>>>> at org.neo4j.unsafe.impl.batchimport.staging.ExecutorServiceSte >>>>>>>>>> p$2.run(ExecutorServiceStep.java:81) >>>>>>>>>> at java.util.concurrent.Executors$RunnableAdapter.call( >>>>>>>>>> Executors.java:471) >>>>>>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:262) >>>>>>>>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool >>>>>>>>>> Executor.java:1145) >>>>>>>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo >>>>>>>>>> lExecutor.java:615) >>>>>>>>>> at java.lang.Thread.run(Thread.java:745) >>>>>>>>>> at org.neo4j.helpers.NamedThreadFactory$2.run(NamedThreadFactor >>>>>>>>>> y.java:99) >>>>>>>>> >>>>>>>>> >>>>>>>>> It seems that it cannot find the start and end node of a >>>>>>>>> relationships. However, both nodes exist in nodes.csv (I did a grep >>>>>>>>> to be >>>>>>>>> sure). So, I don't know what goes wrong. Do you have any idea? Can it >>>>>>>>> be >>>>>>>>> related to the id of the start node "file:///Users/mohsen/Desktop/ >>>>>>>>> Music%20RDF/echonest/analyze-example.rdf#signal"? >>>>>>>>> On Thursday, December 11, 2014 10:02:05 PM UTC-8, Michael Hunger >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> The groovy one should work fine too. I wanted to augment the post >>>>>>>>>> with one that has @CompileStatic so that it's faster. >>>>>>>>>> >>>>>>>>>> I'd be also interested in the --stacktraces output of the >>>>>>>>>> batch-import tool of Neo4j 2.2, perhaps you can let it run over >>>>>>>>>> night or in >>>>>>>>>> the background. >>>>>>>>>> >>>>>>>>>> Cheers, Michael >>>>>>>>>> >>>>>>>>>> On Fri, Dec 12, 2014 at 3:34 AM, mohsen <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> I guess the core code for both batch-import and Load CSV is the >>>>>>>>>>> same, why do you think running it from Cypher (rather than through >>>>>>>>>>> batch-import) helps? I am trying groovy and batch-inserter >>>>>>>>>>> <https://gist.github.com/jexp/0617412dcdd644fd520b#file-import_kaggle-groovy> >>>>>>>>>>> now, >>>>>>>>>>> will post how it goes. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thursday, December 11, 2014 5:44:36 AM UTC-8, Andrii Stesin >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> I'd suggest you take a look at last 5-7 posts in this recent >>>>>>>>>>>> thread >>>>>>>>>>>> <https://groups.google.com/forum/#!topic/neo4j/jSFtnD5OHxg>. >>>>>>>>>>>> You don't basically need any "batch import" command - I'd suggest >>>>>>>>>>>> you to >>>>>>>>>>>> use just a plain LOAD CSV functionality from Cypher, and you will >>>>>>>>>>>> just fill >>>>>>>>>>>> your database step by step. >>>>>>>>>>>> >>>>>>>>>>>> WBR, >>>>>>>>>>>> Andrii >>>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>> Google Groups "Neo4j" group. >>>>>>>>>>> To unsubscribe from this group and stop receiving emails from >>>>>>>>>>> it, send an email to [email protected]. >>>>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "Neo4j" group. >>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>> send an email to [email protected]. >>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "Neo4j" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected]. >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> >>>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "Neo4j" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- You received this message because you are subscribed to the Google Groups "Neo4j" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
