Re: [Neo4j] Re: large cypher statements

Michael Hunger Wed, 19 Nov 2014 13:35:03 -0800

Make sure to have a look at my blog posts:

1. elaborating the individual cypher commands:


if you create nodes and then later match them to connect:

// created indexes
create index on :Movie(title);
create index on :Person(name);

// or alternatively unique constraints

create constraint on m:Movie assert m.title is unique;
create constraint on p:Person assert p.name is unique;

begin
create (:Movie {title:"The Matrix", ...});
create (:Person {name:"Keanu Reeves", ...});
....

// match + create rel
match (m:Movie {title:"The Matrix"}), (p:Person {name:"Keanu
Reeves"}) create (p)-[:ACTED_IN {role:"Neo"}]->(m);
...

commit

2. load csv
http://jexp.de/blog/2014/10/load-cvs-with-success/


On Wed, Nov 19, 2014 at 8:36 PM, José F. Morales <[email protected]> wrote:

> OK Fellas,
>
> A you might imagine the last effort I made didn't work either even though
> I cut down the relationships a lot.  Same maxing out Java at 4 GB and not
> doing anything for 12 +  hours.
>
> OK, so here is my understanding of the Approaches and my likely course of
> action.  Some aspects you guys cite I’m not familiar with…particularly
> indexes and constraints.Never used them before.  I’m going to look for
> examples that can give me an idea of how to do them.
>
> Approaches:
>
> 1. Approach 1:
> a. “Creating only small subgraphs with a single statement separated by
> semicolons.  Eg up to 100 nodes and rels”
> b. surround blocks of statements begin and commit commands
> c. I am assuming that this approach involves cypher statements uploaded
> via the neo4j shell
>
> I am assuming that the format you are referring to is similar to what was
> used in the movies db.  There a few nodes were created, then the
> relationships that used them were created and so on.  Since I used the
> “movies” DB as my model, I did not use the “begin” and “commit” commands in
> my previous code. They seemed to work find and I didn’t know I needed
> them.  I will look up how to use them.  However, this means making sure the
> nodes and relationships are in the proper order.  That’ll take a little
> work.
>
> 2. Approach 2
> a. “…create nodes and later match them by label & property to connect them”
> b. surround blocks of statements begin and commit commands
> c. I am assuming that this approach involves cypher statements uploaded
> via the neo4j shell
>
> I am not sure exactly what you mean here in terms of “…match them by label
> & property to connect them”
>
> 3. CSV approach
> a. “Dump the base into 2 .csv files:”
> b. CSV1:  “Describe nodes (enumerate them via some my_node_id integer
> attribute),  columns: my_node_id,label,node_prop_01,node_prop_ZZ”
> c. CSV2:  “Describe relations,  columns: source_my_node_id,
> dest_my_node_id,rel_type,rel_prop_01,...,rel_prop_NN”
> d. Indexes constraints: before starting import  —> have appropriate
> indexes / constraints
> e. via LOAD CSV, import CSV1, then CSV2.
> f. Import no more than 10,000-30,000 lines in a single LOAD CSV statement
>
> This seems to be a very well elaborated method and the easiest for me to
> do.  I have files such that I can create these without too much problem.  I
> figure I’ll split the nodes into three files 20k rows each.  I can do the
> same with the Rels.  I have not used indexes or constraints yet in the db’s
> that I already created and as I said above, I’ll have to see how to use
> them.
>
> I am assuming column headers that fit with my data are consistent with
> what you explained below (Like, I can put my own meaningful text into Label
> 1 -10 and node_prop_01 - 05)....
> my_node_id,    label1,       label2,       label3,   label4,
>  label5,         label6,             label7,          label8,   label9,
>        label10,           node_prop_01,    node_prop_02,  node_prop_03,
>  node_prop_04,       node_prop_ZZ”
>
> Thanks again Fellas!!
>
> Jose
>
>
> On Wednesday, November 19, 2014 8:04:44 AM UTC-5, Michael Hunger wrote:
>>
>> José,
>>
>> Let's continue the discussion on the google group
>>
>> With larger I meant amount of data, not size of statements
>>
>> As I also point out in various places we recommend creating only small
>> subgraphs with a single statement separated by srmicolons.
>> Eg up to 100 nodes and rels
>>
>> Gigantic statements just let the parser explode
>>
>> I recommending splitting them up into statements creating subgraphs
>> Or create nodes and later match them by label & property to connect them
>> Make sure to have appropriate indexes / constraints
>>
>> You should also surround blocks if statements with begin and commit
>> commands
>>
>> Von meinem iPhone gesendet
>>
>> Am 19.11.2014 um 04:18 schrieb José F. Morales Ph.D. <[email protected]
>> >:
>>
>> Hey Michael and Kenny
>>
>> Thanks you guys a bunch for the help.
>>
>> Let me give you a little background.  I am charged to make a prototype of
>> a tool (“LabCards”) that we hope to use in the hospital and beyond at some
>> point .  In preparation for making the main prototype, I made two prior
>> Neo4j databases that worked exactly as I wanted them to.  The first
>> database was built with NIH data and had 183 nodes and around 7500
>> relationships.  The second database was the Pre-prototype and it had 1080
>> nodes and around 2000 relationships.  I created these in the form of cypher
>> statements and either pasted them in the Neo4j browser or used the neo4j
>> shell and loaded them as text files. Before doing that I checked the cypher
>> code with Sublime Text 2 that highlights the code. Both databases loaded
>> fine in both methods and did what I wanted them to do.
>>
>> As you might imagine, the prototype is an expansion of the
>> mini-prototype.  It has almost the same data model and I built it as a
>> series of cypher statements as well.  My first version of the prototype had
>> ~60k nodes and 160k relationships.
>>
>> I should say that a feature of this model is that all the source and
>> target nodes have relationships that point to each other.  No node points
>> to itself as far as I know. This file was 41 Mb of cypher code that I tried
>> to load via the neo4j shell.
>>
>> In fact, I was following your advise on loading big data files... “Use
>> the Neo4j-Shell for larger Imports”  (http://jexp.de/blog/2014/06/
>> load-csv-into-neo4j-quickly-and-successfully/).   This first time out,
>> Java maxed out its memory allocated at 4Gb 2x and did not complete loading
>> in 24 hours.  I killed it.
>>
>> I then contacted Kenny, and he generously gave me some advice regarding
>> the properties file (below) and again the same deal (4 Gb Memory 2x) with
>> Java and no success in about 24 hours. I killed that one too.
>>
>> Given my loading problems, I have subsequently eliminated a bunch
>> relationships (100k) so that the file is now 21 Mb. Alot of these were
>> duplicates that I didn’t pick up before and am trying it again.  So far 15
>> min into it, similar situation.  The difference is that Java is using 1.7
>> and 0.5 GB of memory
>>
>> Here is the cypher for a typical node…
>>
>> CREATE ( CLT_1:`CLT SOURCE`:BIOMEDICAL:TEST_NAME:`Laboratory
>> Procedure`:lbpr:`Procedures`:PROC:T059:`B1.3.1.1`:TZ{NAME:'Acetoacetate
>> (ketone body)',SYNONYM:'',Sample:'SERUM, URINE',MEDCODE:10010,CUI:'NA’})
>>
>> Here is the cypher for a typical relationship...
>>
>> CREATE(CLT_1)-[:MEASUREMENT_OF{Phylum:'TZ',CAT:'TEST.NAME'
>> ,Ui_Rl:'T157',RESULT:'',Type:'',Semantic_Distance_Score:'NA'
>> ,Path_Length:'NA',Path_Steps:'NA'}]->(CLT_TARGET_3617),
>>
>> I will let you know how this one turns out.  I hope this is helpful.
>>
>> Many, many thanks fellas!!!
>>
>> Jose
>>
>> On Nov 18, 2014, at 8:33 PM, Michael Hunger <[email protected]>
>> wrote:
>>
>> Hi José,
>>
>> can you provide perhaps more detail about your dataset (e.g. sample of
>> the csv, size, etc. perhaps an output of csvstat (of csvkit) would be
>> helpful), your cypher queries to load it
>>
>> Have you seen my other blog post, which explains two big caveats that
>> people run into when trying this? jexp.de/blog/2014/10/
>> load-cvs-with-success/
>>
>> Cheers, Michael
>>
>> On Tue, Nov 18, 2014 at 8:43 PM, Kenny Bastani <[email protected]>
>> wrote:
>>
>>>  Hey Jose,
>>>
>>>  There is definitely an answer. Let me put you in touch with the data
>>> import master: Michael Hunger.
>>>
>>>  Michael, I think the answers here will be pretty straight forward for
>>> you. You met Jose at GraphConnect NY last year, so I'll spare any
>>> introductions. The memory map configurations I provided need to be
>>> calculated and customized for the data import volume.
>>>
>>>  Thanks,
>>>
>>>  Kenny
>>>
>>> Sent from my iPhone
>>>
>>> On Nov 18, 2014, at 11:37 AM, José F. Morales Ph.D. <[email protected]>
>>> wrote:
>>>
>>>   Kenny,
>>>
>>>  In 3 hours it’ll be trying to load for 24 hours so this is not
>>> working.  I’m catching shit from my crew too, so I got to fix this like
>>> soon.
>>>
>>>  I haven’t done this before, but can I break up the data and load it in
>>> pieces?
>>>
>>>  Jose
>>>
>>>  On Nov 17, 2014, at 3:35 PM, Kenny Bastani <[email protected]> wrote:
>>>
>>>  Hey Jose,
>>>
>>>  Try turning off the object cache. Add this line to the
>>> neo4j.properties configuration file:
>>>
>>>  cache_type=none
>>>
>>> Then retry your import. Also, enable memory mapped files by adding these
>>> lines to the neo4j.properties file:
>>>
>>>  neostore.nodestore.db.mapped_memory=2048M
>>> neostore.relationshipstore.db.mapped_memory=4096M
>>> neostore.propertystore.db.mapped_memory=200M
>>> neostore.propertystore.db.strings.mapped_memory=500M
>>> neostore.propertystore.db.arrays.mapped_memory=500M
>>>
>>>  Thanks,
>>>
>>>  Kenny
>>>
>>>  ------------------------------
>>> *From:* José F. Morales Ph.D. <[email protected]>
>>> *Sent:* Monday, November 17, 2014 12:32 PM
>>> *To:* Kenny Bastani
>>> *Subject:* latest
>>>
>>>   Hey Kenny,
>>>
>>>  Here’s the deal. As I think I said, I loaded the 41 Mb file of cypher
>>> code via the neo4j shell. Before I tried the LabCards file, I tried the
>>> movies file and a UMLS database I made (8k relationships).  They worked
>>> fine.
>>>
>>>  The LabCards file is taking a LONG time to load since I started at
>>> about 9:30 - 10 PM last night and its 3PM now.
>>>
>>>  I’ve wondered if its hung up and the activity monitor’s memory usage
>>> is constant at two rows of Java at 4GB w/ the kernel at 1 GB.  The CPU
>>> panel changes alot so it looks like its doing its thing.
>>>
>>>  So is this how are things to be expected?  Do you think the loading is
>>> gonna take a day or two?
>>>
>>>  Jose
>>>
>>>
>>>    |//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\||
>>> José F. Morales Ph.D.
>>>  Instructor
>>>  Cell Biology and Pathology
>>> Columbia University Medical Center
>>>  [email protected]
>>>  212-452-3351
>>>
>>>
>>>    |//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\||
>>> José F. Morales Ph.D.
>>>  Instructor
>>>  Cell Biology and Pathology
>>> Columbia University Medical Center
>>>  [email protected]
>>>  212-452-3351
>>>
>>>
>>
>> |//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\||
>> José F. Morales Ph.D.
>> Instructor
>> Cell Biology and Pathology
>> Columbia University Medical Center
>> [email protected]
>> 212-452-3351
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] Re: large cypher statements

Reply via email to