Re: [Neo4j] LOAD CSV takes over an hour

Aram Chung Sat, 05 Apr 2014 09:18:46 -0700

Hello good people,

I need help!


Since my last post I've been trying to get a slightly altered LOAD CSV 
command to run, without much success. (I haven't been successful writing up 
a blog post either, though a summary is up at Aramology.com, first tile on 
the menu. Any corrections on the content are welcome.)

This (below) is how I want to structure the database, so that all the 
Business nodes that contain the same name points to the same Name node. (I 
also want to prevent it from creating nodes when the names are blanks. Is 
this possible?) The database freezes up halfway through the command (it 
becomes unresponsive and the launcher window goes black).

USING PERIODIC COMMIT 10000
LOAD CSV
  FROM "path/to/Active_Corporations___Beginning_1800.csv"
  AS company
CREATE (n:Business
{
DOS_ID:company[0],
Current_Entity_Name:company[1],
Initial_DOS_Filing_Date:company[2],
County:company[3],
Jurisdiction:company[4],
Entity_Type:company[5],

DOS_Process_Name:company[6],
DOS_Process_Address_1:company[7],
DOS_Process_Address_2:company[8],
DOS_Process_City:company[9],
DOS_Process_State:company[10],
DOS_Process_Zip:company[11],

CEO_Name:company[12],
CEO_Address_1:company[13],
CEO_Address_2:company[14],
CEO_City:company[15],
CEO_State:company[16],
CEO_Zip:company[17],

Registered_Agent_Name:company[18],
Registered_Agent_Address_1:company[19],
Registered_Agent_Address_2:company[20],
Registered_Agent_City:company[21],
Registered_Agent_State:company[22],
Registered_Agent_Zip:company[23],

Location_Name:company[24],
Location_Address_1:company[25],
Location_Address_2:company[26],
Location_City:company[27],
Location_State:company[28],
Location_Zip:company[29]
}
)
MERGE (n0:Name
{
name:company[1]
}
)
CREATE (n)-[:CURRENT_ENTITY_NAME]->(n0)
MERGE (n1:Name
{
name:company[6]
}
)
CREATE (n)-[:DOS_PROCESS_NAME]->(n1)
MERGE (n2:Name
{
name:company[12]
}
)
CREATE (n)-[:CEO_NAME]->(n2)
MERGE (n3:Name
{
name:company[18]
}
)
CREATE (n)-[:REGISTERED_AGENT_NAME]->(n3)
MERGE (n4:Name
{
name:company[24]
}
)
CREATE (n)-[:LOCATION_NAME]->(n4)
;


*1.* I tried to get around this by first using CREATE instead of MERGE, 
then copying over the relationships from duplicate Name nodes to just one 
and deleting the duplicates. This works fine for a 10,000-row version. For 
a while I was even convinced it's linear time, but of course it's not, 
because I have to sort the Name nodes along the way to find the duplicates 
(Is ORDER BY O(n log n)?). The full 1,964,486-row version gets derailed at 
this point. 

I did some counting outside of Neo4j, and the full 1,964,486-row version 
should have 
 5,067,050 relationships and 
 2,816,857 non-blank names once the duplicates are deleted.
Is there an efficient way to get 5,067,050 relationships to 
connect 1,964,486 Business nodes to the correct Name node out of 2,816,857?


*2.* I also recently tried using another computer, this time a Mac, and I 
need some help editing the memory settings. 

First I edited conf/neo4j.properties to include
use_memory_mapped_buffers=true
neostore.nodestore.db.mapped_memory=200M
neostore.relationshipstore.db.mapped_memory=1G
neostore.propertystore.db.mapped_memory=500M
neostore.propertystore.db.strings.mapped_memory=250M
neostore.propertystore.db.arrays.mapped_memory=0M

Then I edited bin/neo4j-shell to say 
EXTRA_JVM_ARGUMENTS00="-Xmx6G -Xms6G -Xmn1G -XX:+UseConcMarkSweepGC -server"

Did I get everything? Even with the original LOAD CSV command just creating 
Business nodes I wasn't able to bring it down to the 6m30 that you had, 
which makes me think I missed something. In my Windows machine I can't get 
6m30 either (it's usually a little less than 30m), but I think that's 
because I can't do -Xmx6G -Xms6G -Xmn1G so I settled for -Xmx4G -Xms4G 
-Xmn1G instead.


I'm very keen to get this working, as I'm getting some amazing query 
results even from the 10,000-row version, that are a great improvement from 
traditional relational databases. Once I can get the full dataset in there, 
Newsday <http://www.newsday.com/> is interested in using it for an ongoing 
investigative news story. I would very much like to know if it can be done. 
The next dataset I need to load in and connect with the current 
1,964,486-row one is a whopping 8,765,456-row csv.

Thanks,
Aram


P.S. This might be on your to-do list already, but will a future version of 
Neo4j support date types? I know there are ways around it, but so much of 
journalism work relies on correct date information that I think this would 
most limit Neo4j's journalistic application. It's not a very pressing 
matter now.



On Wednesday, March 5, 2014 10:39:52 AM UTC-5, Aram Chung wrote:
>
> Wow this is great! I'll definitely try what you did. Please expect 
> questions along the way.
>
> And a write-up is coming; I was thinking I'd do that as soon as I get some 
> relationships in, but now I should probably make a post about LOAD CSV. 
> I'll post a link when I do.
>
> Thanks!
> Aram
>
>
> On Wednesday, March 5, 2014 6:48:34 AM UTC-5, Michael Hunger wrote:
>>
>> Oh and btw. I would LOVE to see a blog post from you about what you're 
>> working on! 
>>
>> Thanks so much 
>>
>> Michael 
>>
>> Am 05.03.2014 um 12:00 schrieb Michael Hunger <
>> [email protected]>: 
>>
>> > I just tested your file on MacOS with these settings 
>> > and got 6:30 for the 2m rows 
>> > 
>> > EXTRA_JVM_ARGUMENTS="-Xmx6G -Xms6G -Xmn1G" 
>> > 
>> > on windows you have to add the memory from the mmio settings in 
>> neo4j.properties to the heap 
>> > 
>> > cat conf/neo4j.properties 
>> > # Default values for the low-level graph engine 
>> > neostore.nodestore.db.mapped_memory=200M 
>> > neostore.relationshipstore.db.mapped_memory=1G 
>> > neostore.propertystore.db.mapped_memory=500M 
>> > neostore.propertystore.db.strings.mapped_memory=250M 
>> > neostore.propertystore.db.arrays.mapped_memory=0M 
>> > 
>> > USING PERIODIC COMMIT 10000 
>> >> LOAD CSV 
>> >>  FROM 
>> "file:///Users/mh/Downloads/Active_Corporations___Beginning_1800_no_head.csv"
>>  
>>
>> >>  AS company 
>> >> CREATE (:DataActiveCorporations 
>> >>    { 
>> >>        DOS_ID:company[0], 
>> >>        Current_Entity_Name:company[1], 
>> >>        Initial_DOS_Filing_Date:company[2], 
>> > ...... 
>> >>        Registered_Agent_Zip:company[23], 
>> >> 
>> >>        Location_Name:company[24], 
>> >>        Location_Address_1:company[25], 
>> >>        Location_Address_2:company[26], 
>> >>        Location_City:company[27], 
>> >>        Location_State:company[28], 
>> >>        Location_Zip:company[29] 
>> >>    } 
>> >> ); 
>> > 
>> > +-------------------+ 
>> > | No data returned. | 
>> > +-------------------+ 
>> > Nodes created: 1964486 
>> > Properties set: 58934580 
>> > Labels added: 1964486 
>> > 391059 ms 
>> > 
>> > 
>> > 
>> > Am 05.03.2014 um 08:34 schrieb Michael Hunger <
>> [email protected]>: 
>> > 
>> >> Oh and if you use neo4j-shell without server you have to set the heap 
>> in bin\Neo4jShell.bat in EXTRA_JVM_ARGUMENTS="-Xmx4G -Xms4G -Xmn1G" 
>> >> 
>> >> and call 
>> >> 
>> >> bin\Neo4jShell -conf conf\neo4j.properties -path data\graph.db 
>> >> 
>> >> Am 05.03.2014 um 08:29 schrieb Michael Hunger <
>> [email protected]>: 
>> >> 
>> >>> Yep, 
>> >>> 
>> >>> it would be also interesting how you ran this? With neo4j-shell? 
>> Against a running server? 
>> >>> Did you configure any RAM or memory mapping setting in 
>> neo4j.properties? 
>> >>> 
>> >>> Check out this blog post for some hints on memory config: 
>> http://blog.bruggen.com/2014/02/some-neo4j-import-tweaks-what-and-where.html?view=sidebar
>>  
>> >>> Note that on windows the heap settings include the mmio settings 
>> unlike other OS'es. 
>> >>> 
>> >>> Michael 
>> >>> 
>> >>> Am 04.03.2014 um 17:22 schrieb Mark Needham <[email protected]>: 
>> >>> 
>> >>>> Hi Aram, 
>> >>>> 
>> >>>> * Do you have any other information of the spec of the machine 
>> you're running this on? e.g. how much RAM etc 
>> >>>> * Have you tried upping the value to PERIODIC COMMIT? Perhaps try it 
>> out with a smaller subset of the data to measure the impact - try it with 
>> values of 1,000 / 10,000 perhaps. 
>> >>>> * I think it would be interesting to pull out some other things as 
>> nodes as well - might lead to more interesting queries e.g. CEO, Location, 
>> Registered Agent, DOS Process, Jurisdiction could all be nodes that link 
>> back to a DOS. 
>> >>>> 
>> >>>> Let me know if any of that doesn't make sense. 
>> >>>> Mark 
>> >>>> 
>> >>>> 
>> >>>> On 4 March 2014 15:54, Aram Chung <[email protected]> wrote: 
>> >>>> Hi, 
>> >>>> 
>> >>>> I was asked to post this here by Mark Needham (@markhneedham) who 
>> thought my query took longer than it should. 
>> >>>> 
>> >>>> I'm trying to see how graph databases could be used in investigative 
>> journalism: I was loading in New York State's Active Corporations: 
>> Beginning 1800 data from 
>> https://data.ny.gov/Economic-Development/Active-Corporations-Beginning-1800/n9v6-gdp6as
>>  a 1964486-row csv (and deleted all U+F8FF characters, because I was 
>> getting "[null] is not a supported property value"). The Cypher query I 
>> used was 
>> >>>> 
>> >>>> USING PERIODIC COMMIT 500 
>> >>>> LOAD CSV 
>> >>>>  FROM 
>> "file://path/to/csv/Active_Corporations___Beginning_1800__without_header__wonky_characters_fixed.csv"
>>  
>>
>> >>>>  AS company 
>> >>>> CREATE (:DataActiveCorporations 
>> >>>>         { 
>> >>>>                 DOS_ID:company[0], 
>> >>>>                 Current_Entity_Name:company[1], 
>> >>>>                 Initial_DOS_Filing_Date:company[2], 
>> >>>>                 County:company[3], 
>> >>>>                 Jurisdiction:company[4], 
>> >>>>                 Entity_Type:company[5], 
>> >>>> 
>> >>>>                 DOS_Process_Name:company[6], 
>> >>>>                 DOS_Process_Address_1:company[7], 
>> >>>>                 DOS_Process_Address_2:company[8], 
>> >>>>                 DOS_Process_City:company[9], 
>> >>>>                 DOS_Process_State:company[10], 
>> >>>>                 DOS_Process_Zip:company[11], 
>> >>>> 
>> >>>>                 CEO_Name:company[12], 
>> >>>>                 CEO_Address_1:company[13], 
>> >>>>                 CEO_Address_2:company[14], 
>> >>>>                 CEO_City:company[15], 
>> >>>>                 CEO_State:company[16], 
>> >>>>                 CEO_Zip:company[17], 
>> >>>> 
>> >>>>                 Registered_Agent_Name:company[18], 
>> >>>>                 Registered_Agent_Address_1:company[19], 
>> >>>>                 Registered_Agent_Address_2:company[20], 
>> >>>>                 Registered_Agent_City:company[21], 
>> >>>>                 Registered_Agent_State:company[22], 
>> >>>>                 Registered_Agent_Zip:company[23], 
>> >>>> 
>> >>>>                 Location_Name:company[24], 
>> >>>>                 Location_Address_1:company[25], 
>> >>>>                 Location_Address_2:company[26], 
>> >>>>                 Location_City:company[27], 
>> >>>>                 Location_State:company[28], 
>> >>>>                 Location_Zip:company[29] 
>> >>>>         } 
>> >>>> ); 
>> >>>> 
>> >>>> Each row is one node so it's as close to the raw data as possible. 
>> The idea is loosely that these nodes will be linked with new nodes 
>> representing people and addresses verified by reporters. 
>> >>>> 
>> >>>> This is what I got: 
>> >>>> 
>> >>>> +-------------------+ 
>> >>>> | No data returned. | 
>> >>>> +-------------------+ 
>> >>>> Nodes created: 1964486 
>> >>>> Properties set: 58934580 
>> >>>> Labels added: 1964486 
>> >>>> 4550855 ms 
>> >>>> 
>> >>>> Some context information: 
>> >>>> Neo4j Milestone Release 2.1.0-M01 
>> >>>> Windows 7 
>> >>>> java version "1.7.0_03" 
>> >>>> 
>> >>>> Best, 
>> >>>> Aram 
>> >>>> 
>> >>>> -- 
>> >>>> You received this message because you are subscribed to the Google 
>> Groups "Neo4j" group. 
>> >>>> To unsubscribe from this group and stop receiving emails from it, 
>> send an email to [email protected]. 
>> >>>> For more options, visit https://groups.google.com/groups/opt_out. 
>> >>>> 
>> >>>> 
>> >>>> -- 
>> >>>> You received this message because you are subscribed to the Google 
>> Groups "Neo4j" group. 
>> >>>> To unsubscribe from this group and stop receiving emails from it, 
>> send an email to [email protected]. 
>> >>>> For more options, visit https://groups.google.com/groups/opt_out. 
>> >>> 
>> >> 
>> > 
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] LOAD CSV takes over an hour

Reply via email to