Re: [Neo4j] LOAD CSV takes over an hour

Michael Hunger Wed, 05 Mar 2014 03:00:22 -0800

I just tested your file on MacOS with these settings
and got 6:30 for the 2m rows


EXTRA_JVM_ARGUMENTS="-Xmx6G -Xms6G -Xmn1G"

on windows you have to add the memory from the mmio settings in 
neo4j.properties to the heap

cat conf/neo4j.properties 
# Default values for the low-level graph engine
neostore.nodestore.db.mapped_memory=200M
neostore.relationshipstore.db.mapped_memory=1G
neostore.propertystore.db.mapped_memory=500M
neostore.propertystore.db.strings.mapped_memory=250M
neostore.propertystore.db.arrays.mapped_memory=0M

USING PERIODIC COMMIT 10000
> LOAD CSV
>   FROM 
> "file:///Users/mh/Downloads/Active_Corporations___Beginning_1800_no_head.csv"
>   AS company
> CREATE (:DataActiveCorporations
>     {
>         DOS_ID:company[0],
>         Current_Entity_Name:company[1],
>         Initial_DOS_Filing_Date:company[2],
......
>         Registered_Agent_Zip:company[23],
> 
>         Location_Name:company[24],
>         Location_Address_1:company[25],
>         Location_Address_2:company[26],
>         Location_City:company[27],
>         Location_State:company[28],
>         Location_Zip:company[29]
>     }
> );

+-------------------+
| No data returned. |
+-------------------+
Nodes created: 1964486
Properties set: 58934580
Labels added: 1964486
391059 ms



Am 05.03.2014 um 08:34 schrieb Michael Hunger 
<[email protected]>:

> Oh and if you use neo4j-shell without server you have to set the heap in 
> bin\Neo4jShell.bat in EXTRA_JVM_ARGUMENTS="-Xmx4G -Xms4G -Xmn1G"
> 
> and call 
> 
> bin\Neo4jShell -conf conf\neo4j.properties -path data\graph.db
> 
> Am 05.03.2014 um 08:29 schrieb Michael Hunger 
> <[email protected]>:
> 
>> Yep,
>> 
>> it would be also interesting how you ran this? With neo4j-shell? Against a 
>> running server?
>> Did you configure any RAM or memory mapping setting in neo4j.properties?
>> 
>> Check out this blog post for some hints on memory config: 
>> http://blog.bruggen.com/2014/02/some-neo4j-import-tweaks-what-and-where.html?view=sidebar
>> Note that on windows the heap settings include the mmio settings unlike 
>> other OS'es.
>> 
>> Michael
>> 
>> Am 04.03.2014 um 17:22 schrieb Mark Needham <[email protected]>:
>> 
>>> Hi Aram,
>>> 
>>> * Do you have any other information of the spec of the machine you're 
>>> running this on? e.g. how much RAM etc
>>> * Have you tried upping the value to PERIODIC COMMIT? Perhaps try it out 
>>> with a smaller subset of the data to measure the impact - try it with 
>>> values of 1,000 / 10,000 perhaps. 
>>> * I think it would be interesting to pull out some other things as nodes as 
>>> well - might lead to more interesting queries e.g. CEO, Location, 
>>> Registered Agent, DOS Process, Jurisdiction could all be nodes that link 
>>> back to a DOS. 
>>> 
>>> Let me know if any of that doesn't make sense.
>>> Mark
>>> 
>>> 
>>> On 4 March 2014 15:54, Aram Chung <[email protected]> wrote:
>>> Hi,
>>> 
>>> I was asked to post this here by Mark Needham (@markhneedham) who thought 
>>> my query took longer than it should.
>>> 
>>> I'm trying to see how graph databases could be used in investigative 
>>> journalism: I was loading in New York State's Active Corporations: 
>>> Beginning 1800 data from 
>>> https://data.ny.gov/Economic-Development/Active-Corporations-Beginning-1800/n9v6-gdp6
>>>  as a 1964486-row csv (and deleted all U+F8FF characters, because I was 
>>> getting "[null] is not a supported property value"). The Cypher query I 
>>> used was 
>>> 
>>> USING PERIODIC COMMIT 500
>>> LOAD CSV
>>>   FROM 
>>> "file://path/to/csv/Active_Corporations___Beginning_1800__without_header__wonky_characters_fixed.csv"
>>>   AS company
>>> CREATE (:DataActiveCorporations
>>>     {
>>>             DOS_ID:company[0],
>>>             Current_Entity_Name:company[1],
>>>             Initial_DOS_Filing_Date:company[2],
>>>             County:company[3],
>>>             Jurisdiction:company[4],
>>>             Entity_Type:company[5],
>>> 
>>>             DOS_Process_Name:company[6],
>>>             DOS_Process_Address_1:company[7],
>>>             DOS_Process_Address_2:company[8],
>>>             DOS_Process_City:company[9],
>>>             DOS_Process_State:company[10],
>>>             DOS_Process_Zip:company[11],
>>> 
>>>             CEO_Name:company[12],
>>>             CEO_Address_1:company[13],
>>>             CEO_Address_2:company[14],
>>>             CEO_City:company[15],
>>>             CEO_State:company[16],
>>>             CEO_Zip:company[17],
>>> 
>>>             Registered_Agent_Name:company[18],
>>>             Registered_Agent_Address_1:company[19],
>>>             Registered_Agent_Address_2:company[20],
>>>             Registered_Agent_City:company[21],
>>>             Registered_Agent_State:company[22],
>>>             Registered_Agent_Zip:company[23],
>>> 
>>>             Location_Name:company[24],
>>>             Location_Address_1:company[25],
>>>             Location_Address_2:company[26],
>>>             Location_City:company[27],
>>>             Location_State:company[28],
>>>             Location_Zip:company[29]
>>>     }
>>> );
>>> 
>>> Each row is one node so it's as close to the raw data as possible. The idea 
>>> is loosely that these nodes will be linked with new nodes representing 
>>> people and addresses verified by reporters.
>>> 
>>> This is what I got:
>>> 
>>> +-------------------+
>>> | No data returned. |
>>> +-------------------+
>>> Nodes created: 1964486
>>> Properties set: 58934580
>>> Labels added: 1964486
>>> 4550855 ms
>>> 
>>> Some context information: 
>>> Neo4j Milestone Release 2.1.0-M01
>>> Windows 7
>>> java version "1.7.0_03"
>>> 
>>> Best,
>>> Aram
>>> 
>>> -- 
>>> You received this message because you are subscribed to the Google Groups 
>>> "Neo4j" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>> email to [email protected].
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>> 
>>> 
>>> -- 
>>> You received this message because you are subscribed to the Google Groups 
>>> "Neo4j" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>> email to [email protected].
>>> For more options, visit https://groups.google.com/groups/opt_out.
>> 
> 

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: [Neo4j] LOAD CSV takes over an hour

Reply via email to