Re: [Neo4j] LOAD CSV takes over an hour

Aram Chung Wed, 05 Mar 2014 07:50:07 -0800

Wow this is great! I'll definitely try what you did. Please expect 
questions along the way.


And a write-up is coming; I was thinking I'd do that as soon as I get some 
relationships in, but now I should probably make a post about LOAD CSV. 
I'll post a link when I do.

Thanks!
Aram


On Wednesday, March 5, 2014 6:48:34 AM UTC-5, Michael Hunger wrote:
>
> Oh and btw. I would LOVE to see a blog post from you about what you're 
> working on! 
>
> Thanks so much 
>
> Michael 
>
> Am 05.03.2014 um 12:00 schrieb Michael Hunger <
> [email protected] <javascript:>>: 
>
> > I just tested your file on MacOS with these settings 
> > and got 6:30 for the 2m rows 
> > 
> > EXTRA_JVM_ARGUMENTS="-Xmx6G -Xms6G -Xmn1G" 
> > 
> > on windows you have to add the memory from the mmio settings in 
> neo4j.properties to the heap 
> > 
> > cat conf/neo4j.properties 
> > # Default values for the low-level graph engine 
> > neostore.nodestore.db.mapped_memory=200M 
> > neostore.relationshipstore.db.mapped_memory=1G 
> > neostore.propertystore.db.mapped_memory=500M 
> > neostore.propertystore.db.strings.mapped_memory=250M 
> > neostore.propertystore.db.arrays.mapped_memory=0M 
> > 
> > USING PERIODIC COMMIT 10000 
> >> LOAD CSV 
> >>  FROM 
> "file:///Users/mh/Downloads/Active_Corporations___Beginning_1800_no_head.csv" 
>
> >>  AS company 
> >> CREATE (:DataActiveCorporations 
> >>    { 
> >>        DOS_ID:company[0], 
> >>        Current_Entity_Name:company[1], 
> >>        Initial_DOS_Filing_Date:company[2], 
> > ...... 
> >>        Registered_Agent_Zip:company[23], 
> >> 
> >>        Location_Name:company[24], 
> >>        Location_Address_1:company[25], 
> >>        Location_Address_2:company[26], 
> >>        Location_City:company[27], 
> >>        Location_State:company[28], 
> >>        Location_Zip:company[29] 
> >>    } 
> >> ); 
> > 
> > +-------------------+ 
> > | No data returned. | 
> > +-------------------+ 
> > Nodes created: 1964486 
> > Properties set: 58934580 
> > Labels added: 1964486 
> > 391059 ms 
> > 
> > 
> > 
> > Am 05.03.2014 um 08:34 schrieb Michael Hunger <
> [email protected] <javascript:>>: 
> > 
> >> Oh and if you use neo4j-shell without server you have to set the heap 
> in bin\Neo4jShell.bat in EXTRA_JVM_ARGUMENTS="-Xmx4G -Xms4G -Xmn1G" 
> >> 
> >> and call 
> >> 
> >> bin\Neo4jShell -conf conf\neo4j.properties -path data\graph.db 
> >> 
> >> Am 05.03.2014 um 08:29 schrieb Michael Hunger <
> [email protected] <javascript:>>: 
> >> 
> >>> Yep, 
> >>> 
> >>> it would be also interesting how you ran this? With neo4j-shell? 
> Against a running server? 
> >>> Did you configure any RAM or memory mapping setting in 
> neo4j.properties? 
> >>> 
> >>> Check out this blog post for some hints on memory config: 
> http://blog.bruggen.com/2014/02/some-neo4j-import-tweaks-what-and-where.html?view=sidebar
>  
> >>> Note that on windows the heap settings include the mmio settings 
> unlike other OS'es. 
> >>> 
> >>> Michael 
> >>> 
> >>> Am 04.03.2014 um 17:22 schrieb Mark Needham 
> >>> <[email protected]<javascript:>>: 
>
> >>> 
> >>>> Hi Aram, 
> >>>> 
> >>>> * Do you have any other information of the spec of the machine you're 
> running this on? e.g. how much RAM etc 
> >>>> * Have you tried upping the value to PERIODIC COMMIT? Perhaps try it 
> out with a smaller subset of the data to measure the impact - try it with 
> values of 1,000 / 10,000 perhaps. 
> >>>> * I think it would be interesting to pull out some other things as 
> nodes as well - might lead to more interesting queries e.g. CEO, Location, 
> Registered Agent, DOS Process, Jurisdiction could all be nodes that link 
> back to a DOS. 
> >>>> 
> >>>> Let me know if any of that doesn't make sense. 
> >>>> Mark 
> >>>> 
> >>>> 
> >>>> On 4 March 2014 15:54, Aram Chung <[email protected] <javascript:>> 
> wrote: 
> >>>> Hi, 
> >>>> 
> >>>> I was asked to post this here by Mark Needham (@markhneedham) who 
> thought my query took longer than it should. 
> >>>> 
> >>>> I'm trying to see how graph databases could be used in investigative 
> journalism: I was loading in New York State's Active Corporations: 
> Beginning 1800 data from 
> https://data.ny.gov/Economic-Development/Active-Corporations-Beginning-1800/n9v6-gdp6as
>  a 1964486-row csv (and deleted all U+F8FF characters, because I was 
> getting "[null] is not a supported property value"). The Cypher query I 
> used was 
> >>>> 
> >>>> USING PERIODIC COMMIT 500 
> >>>> LOAD CSV 
> >>>>  FROM 
> "file://path/to/csv/Active_Corporations___Beginning_1800__without_header__wonky_characters_fixed.csv"
>  
>
> >>>>  AS company 
> >>>> CREATE (:DataActiveCorporations 
> >>>>         { 
> >>>>                 DOS_ID:company[0], 
> >>>>                 Current_Entity_Name:company[1], 
> >>>>                 Initial_DOS_Filing_Date:company[2], 
> >>>>                 County:company[3], 
> >>>>                 Jurisdiction:company[4], 
> >>>>                 Entity_Type:company[5], 
> >>>> 
> >>>>                 DOS_Process_Name:company[6], 
> >>>>                 DOS_Process_Address_1:company[7], 
> >>>>                 DOS_Process_Address_2:company[8], 
> >>>>                 DOS_Process_City:company[9], 
> >>>>                 DOS_Process_State:company[10], 
> >>>>                 DOS_Process_Zip:company[11], 
> >>>> 
> >>>>                 CEO_Name:company[12], 
> >>>>                 CEO_Address_1:company[13], 
> >>>>                 CEO_Address_2:company[14], 
> >>>>                 CEO_City:company[15], 
> >>>>                 CEO_State:company[16], 
> >>>>                 CEO_Zip:company[17], 
> >>>> 
> >>>>                 Registered_Agent_Name:company[18], 
> >>>>                 Registered_Agent_Address_1:company[19], 
> >>>>                 Registered_Agent_Address_2:company[20], 
> >>>>                 Registered_Agent_City:company[21], 
> >>>>                 Registered_Agent_State:company[22], 
> >>>>                 Registered_Agent_Zip:company[23], 
> >>>> 
> >>>>                 Location_Name:company[24], 
> >>>>                 Location_Address_1:company[25], 
> >>>>                 Location_Address_2:company[26], 
> >>>>                 Location_City:company[27], 
> >>>>                 Location_State:company[28], 
> >>>>                 Location_Zip:company[29] 
> >>>>         } 
> >>>> ); 
> >>>> 
> >>>> Each row is one node so it's as close to the raw data as possible. 
> The idea is loosely that these nodes will be linked with new nodes 
> representing people and addresses verified by reporters. 
> >>>> 
> >>>> This is what I got: 
> >>>> 
> >>>> +-------------------+ 
> >>>> | No data returned. | 
> >>>> +-------------------+ 
> >>>> Nodes created: 1964486 
> >>>> Properties set: 58934580 
> >>>> Labels added: 1964486 
> >>>> 4550855 ms 
> >>>> 
> >>>> Some context information: 
> >>>> Neo4j Milestone Release 2.1.0-M01 
> >>>> Windows 7 
> >>>> java version "1.7.0_03" 
> >>>> 
> >>>> Best, 
> >>>> Aram 
> >>>> 
> >>>> -- 
> >>>> You received this message because you are subscribed to the Google 
> Groups "Neo4j" group. 
> >>>> To unsubscribe from this group and stop receiving emails from it, 
> send an email to [email protected] <javascript:>. 
> >>>> For more options, visit https://groups.google.com/groups/opt_out. 
> >>>> 
> >>>> 
> >>>> -- 
> >>>> You received this message because you are subscribed to the Google 
> Groups "Neo4j" group. 
> >>>> To unsubscribe from this group and stop receiving emails from it, 
> send an email to [email protected] <javascript:>. 
> >>>> For more options, visit https://groups.google.com/groups/opt_out. 
> >>> 
> >> 
> > 
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: [Neo4j] LOAD CSV takes over an hour

Reply via email to