[Neo4j] Re: large cypher statements

José F . Morales Wed, 19 Nov 2014 11:37:16 -0800

OK Fellas,

A you might imagine the last effort I made didn't work either even though I 
cut down the relationships a lot.  Same maxing out Java at 4 GB and not 
doing anything for 12 +  hours.


OK, so here is my understanding of the Approaches and my likely course of 
action.  Some aspects you guys cite I’m not familiar with…particularly 
indexes and constraints.Never used them before.  I’m going to look for 
examples that can give me an idea of how to do them.

Approaches:

1. Approach 1:
a. “Creating only small subgraphs with a single statement separated by 
semicolons.  Eg up to 100 nodes and rels”   
b. surround blocks of statements begin and commit commands 
c. I am assuming that this approach involves cypher statements uploaded via 
the neo4j shell

I am assuming that the format you are referring to is similar to what was 
used in the movies db.  There a few nodes were created, then the 
relationships that used them were created and so on.  Since I used the 
“movies” DB as my model, I did not use the “begin” and “commit” commands in 
my previous code. They seemed to work find and I didn’t know I needed them. 
 I will look up how to use them.  However, this means making sure the nodes 
and relationships are in the proper order.  That’ll take a little work.

2. Approach 2
a. “…create nodes and later match them by label & property to connect them”
b. surround blocks of statements begin and commit commands
c. I am assuming that this approach involves cypher statements uploaded via 
the neo4j shell

I am not sure exactly what you mean here in terms of “…match them by label 
& property to connect them”

3. CSV approach 
a. “Dump the base into 2 .csv files:”
b. CSV1:  “Describe nodes (enumerate them via some my_node_id integer 
attribute),  columns: my_node_id,label,node_prop_01,node_prop_ZZ”
c. CSV2:  “Describe relations,  columns: source_my_node_id, 
dest_my_node_id,rel_type,rel_prop_01,...,rel_prop_NN”
d. Indexes constraints: before starting import  —> have appropriate indexes 
/ constraints
e. via LOAD CSV, import CSV1, then CSV2. 
f. Import no more than 10,000-30,000 lines in a single LOAD CSV statement 

This seems to be a very well elaborated method and the easiest for me to 
do.  I have files such that I can create these without too much problem.  I 
figure I’ll split the nodes into three files 20k rows each.  I can do the 
same with the Rels.  I have not used indexes or constraints yet in the db’s 
that I already created and as I said above, I’ll have to see how to use 
them.

I am assuming column headers that fit with my data are consistent with what 
you explained below (Like, I can put my own meaningful text into Label 1 
-10 and node_prop_01 - 05).... 
my_node_id,    label1,       label2,       label3,   label4,           
 label5,         label6,             label7,          label8,   label9,     
       label10,           node_prop_01,    node_prop_02,  node_prop_03, 
 node_prop_04,       node_prop_ZZ”

Thanks again Fellas!!

Jose


On Wednesday, November 19, 2014 8:04:44 AM UTC-5, Michael Hunger wrote:
>
> José,
>
> Let's continue the discussion on the google group
>
> With larger I meant amount of data, not size of statements
>
> As I also point out in various places we recommend creating only small 
> subgraphs with a single statement separated by srmicolons.
> Eg up to 100 nodes and rels
>
> Gigantic statements just let the parser explode
>
> I recommending splitting them up into statements creating subgraphs
> Or create nodes and later match them by label & property to connect them
> Make sure to have appropriate indexes / constraints
>
> You should also surround blocks if statements with begin and commit 
> commands
>
> Von meinem iPhone gesendet
>
> Am 19.11.2014 um 04:18 schrieb José F. Morales Ph.D. <[email protected] 
> <javascript:>>:
>
> Hey Michael and Kenny
>
> Thanks you guys a bunch for the help.
>
> Let me give you a little background.  I am charged to make a prototype of 
> a tool (“LabCards”) that we hope to use in the hospital and beyond at some 
> point .  In preparation for making the main prototype, I made two prior 
> Neo4j databases that worked exactly as I wanted them to.  The first 
> database was built with NIH data and had 183 nodes and around 7500 
> relationships.  The second database was the Pre-prototype and it had 1080 
> nodes and around 2000 relationships.  I created these in the form of cypher 
> statements and either pasted them in the Neo4j browser or used the neo4j 
> shell and loaded them as text files. Before doing that I checked the cypher 
> code with Sublime Text 2 that highlights the code. Both databases loaded 
> fine in both methods and did what I wanted them to do.  
>
> As you might imagine, the prototype is an expansion of the mini-prototype. 
>  It has almost the same data model and I built it as a series of cypher 
> statements as well.  My first version of the prototype had ~60k nodes and 
> 160k relationships.  
>
> I should say that a feature of this model is that all the source and 
> target nodes have relationships that point to each other.  No node points 
> to itself as far as I know. This file was 41 Mb of cypher code that I tried 
> to load via the neo4j shell.  
>
> In fact, I was following your advise on loading big data files... “Use the 
> Neo4j-Shell for larger Imports”  (
> http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-and-successfully/). 
>   This first time out, Java maxed out its memory allocated at 4Gb 2x and 
> did not complete loading in 24 hours.  I killed it. 
>
> I then contacted Kenny, and he generously gave me some advice regarding 
> the properties file (below) and again the same deal (4 Gb Memory 2x) with 
> Java and no success in about 24 hours. I killed that one too.
>
> Given my loading problems, I have subsequently eliminated a bunch 
> relationships (100k) so that the file is now 21 Mb. Alot of these were 
> duplicates that I didn’t pick up before and am trying it again.  So far 15 
> min into it, similar situation.  The difference is that Java is using 1.7 
> and 0.5 GB of memory
>
> Here is the cypher for a typical node…
>
> CREATE ( CLT_1:`CLT SOURCE`:BIOMEDICAL:TEST_NAME:`Laboratory 
> Procedure`:lbpr:`Procedures`:PROC:T059:`B1.3.1.1`:TZ{NAME:'Acetoacetate 
> (ketone body)',SYNONYM:'',Sample:'SERUM, URINE',MEDCODE:10010,CUI:'NA’})
>
> Here is the cypher for a typical relationship...
>
> CREATE(CLT_1)-[:MEASUREMENT_OF{Phylum:'TZ',CAT:'TEST.NAME
> ',Ui_Rl:'T157',RESULT:'',Type:'',Semantic_Distance_Score:'NA',Path_Length:'NA',Path_Steps:'NA'}]->(CLT_TARGET_3617),
>
> I will let you know how this one turns out.  I hope this is helpful.
>
> Many, many thanks fellas!!!
>
> Jose
>
> On Nov 18, 2014, at 8:33 PM, Michael Hunger <[email protected] 
> <javascript:>> wrote:
>
> Hi José,
>
> can you provide perhaps more detail about your dataset (e.g. sample of the 
> csv, size, etc. perhaps an output of csvstat (of csvkit) would be helpful), 
> your cypher queries to load it
>
> Have you seen my other blog post, which explains two big caveats that 
> people run into when trying this? 
> jexp.de/blog/2014/10/load-cvs-with-success/
>
> Cheers, Michael
>
> On Tue, Nov 18, 2014 at 8:43 PM, Kenny Bastani <[email protected] 
> <javascript:>> wrote:
>
>>  Hey Jose,
>>
>>  There is definitely an answer. Let me put you in touch with the data 
>> import master: Michael Hunger.
>>
>>  Michael, I think the answers here will be pretty straight forward for 
>> you. You met Jose at GraphConnect NY last year, so I'll spare any 
>> introductions. The memory map configurations I provided need to be 
>> calculated and customized for the data import volume.
>>
>>  Thanks,
>>
>>  Kenny
>>
>> Sent from my iPhone
>>
>> On Nov 18, 2014, at 11:37 AM, José F. Morales Ph.D. <[email protected] 
>> <javascript:>> wrote:
>>
>>   Kenny,  
>>
>>  In 3 hours it’ll be trying to load for 24 hours so this is not 
>> working.  I’m catching shit from my crew too, so I got to fix this like 
>> soon.
>>
>>  I haven’t done this before, but can I break up the data and load it in 
>> pieces?
>>
>>  Jose
>>
>>  On Nov 17, 2014, at 3:35 PM, Kenny Bastani <[email protected] 
>> <javascript:>> wrote:
>>
>>  Hey Jose,
>>
>>  Try turning off the object cache. Add this line to the neo4j.properties 
>> configuration file:
>>
>>  cache_type=none
>>
>> Then retry your import. Also, enable memory mapped files by adding these 
>> lines to the neo4j.properties file:
>>
>>  neostore.nodestore.db.mapped_memory=2048M
>> neostore.relationshipstore.db.mapped_memory=4096M
>> neostore.propertystore.db.mapped_memory=200M
>> neostore.propertystore.db.strings.mapped_memory=500M
>> neostore.propertystore.db.arrays.mapped_memory=500M
>>  
>>  Thanks,
>>
>>  Kenny
>>  
>>  ------------------------------
>> *From:* José F. Morales Ph.D. <[email protected] <javascript:>>
>> *Sent:* Monday, November 17, 2014 12:32 PM
>> *To:* Kenny Bastani
>> *Subject:* latest 
>>  
>>   Hey Kenny,
>>
>>  Here’s the deal. As I think I said, I loaded the 41 Mb file of cypher 
>> code via the neo4j shell. Before I tried the LabCards file, I tried the 
>> movies file and a UMLS database I made (8k relationships).  They worked 
>> fine. 
>>
>>  The LabCards file is taking a LONG time to load since I started at 
>> about 9:30 - 10 PM last night and its 3PM now.  
>>
>>  I’ve wondered if its hung up and the activity monitor’s memory usage is 
>> constant at two rows of Java at 4GB w/ the kernel at 1 GB.  The CPU panel 
>> changes alot so it looks like its doing its thing. 
>>
>>  So is this how are things to be expected?  Do you think the loading is 
>> gonna take a day or two?  
>>
>>  Jose
>>  
>>  
>>    |//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\||
>> José F. Morales Ph.D.
>>  Instructor
>>  Cell Biology and Pathology
>> Columbia University Medical Center
>>  [email protected] <javascript:>
>>  212-452-3351
>>     
>>  
>>    |//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\||
>> José F. Morales Ph.D.
>>  Instructor
>>  Cell Biology and Pathology
>> Columbia University Medical Center
>>  [email protected] <javascript:>
>>  212-452-3351
>>   
>>   
>
> |//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\||
> José F. Morales Ph.D.
> Instructor
> Cell Biology and Pathology
> Columbia University Medical Center
> [email protected] <javascript:>
> 212-452-3351
>  
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[Neo4j] Re: large cypher statements

Reply via email to