[Neo4j] Re: large cypher statements

José F . Morales Sun, 30 Nov 2014 14:34:31 -0800


On Saturday, November 29, 2014 6:35:33 AM UTC-5, Andrii Stesin wrote:
>
> Hi Jose,
>
> On Saturday, November 29, 2014 1:12:52 AM UTC+2, José F. Morales wrote:
>>
>>
>>> 1. On nodes and their labels. First of all, I strongly suggest you to 
>>> separate your nodes into different .csv files by label. So you won't have a 
>>> column *`label`* in your .csv but rather set of files:
>>>
>>> nodes_LabelA.csv
>>> ...
>>> nodes_LabelZ.csv
>>>
>>> whatever your labels are. (Consider label to be kinda of synonym for 
>>> `class` in object-oriented programming or `table` in RDBMS). That's due the 
>>> fact that labels in Cypher are somewhat specific entities and you probably 
>>> won't be allowed to make them parameterized into variables inside your LOAD 
>>> CSV statement.
>>>
>>>
>> OK, so you have modified your original idea of putting the db into two 
>> files 1 nodes , 1 relationships.  Now here you say, put all the nodes into 
>> 1 file/ label.   The way I have worked with it, I created 1 file for a 
>> class of nodes I'll call CLT_SOURCE and another file for a class of nodes 
>> called CLT_TARGET.
>>
>
> Ok, but how many valid distinct combinations of your 10 node labels may 
> exist? 
>


JFM: 264
 

> I was speaking about a simple case where you have some limited number of 
> possible node labels (or their combinations), say less than 10.
>

JFM: Lot more than that.
 

>
>  You are recommending that with the nodes, I take two steps...
>
>> 1) Combine my CLT_SOURCE and CLT_TARGET nodes, 
>>
>
> Not nessesary "combine" but just give each node a unique (temporary) 
> *my_node_id 
> *see my "10+M tree" example below.
>  
>
>> 2) then I split that file into files that correspond to the node: 
>> *my_node_id, * 1 label, and then properties P1...Pn.  Since I have 10 
>> Labels/node, I should have 10 files named..... Nodes_LabelA... 
>> Nodes_LabelJ.  Thus...
>>
>
> You may have as much labels per node you wish, but it is all about how 
> many valid distinct combinations of labels you have. (One single label is a 
> combination itself, obviously).
>
> If you have some limited quantity of valid label combination it's one 
> story. But if we are talking about order of 10! possible valid 
> combinations, the story is somewhat more interesting :) Which setup is 
> yours?
>

JFM:  Like I said, there are 264 unique combinations in all my nodes. Some 
are redundant, full spelling of a term/phrase and an abbreviation.  Some 
are a code for a term/phrase.  Some were created in anticipation of others 
values I would create later.  I am trying to anticipate queries I'll make 
later.
 

>  
>
>> File:  CLT_Nodes-LabelA     columns:  *my_node_id,* label A, property 
>> P1..., property P4
>> ...
>> File:  CLT_Nodes-LabelJ     columns:  *my_node_id,* label B, property 
>> P1..., property P4
>>
>>
>> Q1: What are the rules about what can be used for *my_node_id?  *I have 
>> usually seen them as a letter integer combination. Is that the convention? 
>>   Sometimes I've seen a letter being used with a specific class of nodes 
>>  a1..a100 for one class and b1..b100 for another.  I learned the hard way 
>> that you have to give each node a unique ID.  I used CLT_1...CLT_n for my 
>> CLT_SOURCE nodes and CLT_TARGET_1...CLT_TARGET_n for my TARGET nodes. 
>> It worked with the smaller db I made.  Anything wrong using the convention 
>> n1...n100?
>>
>
> I'm not aware of any conventions here, the only thing I know for sure is 
> that *schema index works much(!) faster on plain integers than on Unicode 
> strings*. That's the only difference which I consider significant. So my 
> personal preference is to have *my_node_id* to be a unique integer. Once 
> when importing a 10+ millions nodes into a tree with variable height [1..7] 
> where each level of nodes was in a separate file (because of level's own 
> unique label and unique set of properties) I just selected a schema for 
> numbering them like
>
>
JFM: Makes sense for speed. I guess it depends upon the size of one's data.
 

> :Skewer:Level1 my_node_id = 10000000 + file1.csv line number
> :Skewer:Level2 my_node_id = 20000000 + file2.csv line number
> ...
> :Skewer:Level7 my_node_id = 70000000 + file7.csv line number
>
> so relationship file (all relationships were of a same single type) has 
> become a simple 2 column .csv like this with 10+ millions of lines
>
> 10000017,20000362
> 10000017,20000547
> 10000017,40083215
> 10000018,30000397
> ...
>
> After successful importing of 7 node files (and have nodes ready in db and 
> indexed on their unique *my_node_id* under the label :Skewer) I split 
> relationships.csv into 1000+ files with 10000 lines each and wrote a dumb 
> shell script which loaded them with `neo4j-shell -c`  file by file doing 
> `sleep 60` between files (to give neo4j a minute to complete each batch 
> transaction) than started it Friday evening and got my tree ready on Monday 
> morning :)
>
> If you prefer alphanumerics for my_node_id it's completely up to you :) 
> Anyway, after successful import you may prefer to remove those temporary 
> ids completely from the database, just to conserve space where properties 
> are stored.
>  
>

JFM: OK.  Sounds good.

 

> 2. Then consider one additional "technological" label, let's name it 
>>> `:Skewer` because it will "penetrate" all your nodes of every different 
>>> label (class) like a kebab skewer.
>>>
>>> Before you start (or at least before you start importing relationships) 
>>> do
>>>
>>> CREATE CONSTRAINT ON (every_node:Skewer) ASSERT every_node.my_node_id 
>>> IS UNIQUE;
>>>
>>>
>> Q2:  Should I do scenario 1 or 2?
>>
>> Scenario 1:  add two labels to each file?  One from my original nodes and 
>> one as "Skewer"
>>
>> File 1:  CLT_Nodes-LabelA     columns:  *my_node_id,* label A, *Skewer*, 
>> property P1..., property P4
>> ...
>> File 2:  CLT_Nodes-LabelJ     columns:  *my_node_id,* label J, *Skewer*, 
>> property P1..., property P4
>>  
>> OR 
>>
>> Scenario 2:  Include an eleventh file thus....
>>
>> File 11:  CLT_Nodes-LabelK     columns:  *my_node_id,* *Skewer*, 
>> property P1..., property P4 
>>
>> From below, I think you mean Scenario 1.
>>
>
> Yes and you don't need to add a column for :Skewer label into a file, the 
> LOAD CSV statement should assign it.
>  
>

JFM: OK.  Sounds good.  
 

> Q3: “Skewer” is just an integer right?  It corresponds in a way to 
>> my_node_id 
>>
>
> No, it's a label! so in Cypher your node (suppose it has 2 labels :LabelA 
> and :LabelJ ) is described like
>
> MATCH (n:LabelA:LabelJ:Skewer {my_node_id: 123454, p1: 'something', p2: 
> 'something 
> else', p3: 'etc.'})
>
>
JFM: Got that!

JFM: ok basic question...  MATCH (n:  <---What is "n"? Does it just 
indicate that its a node of a particular class?  What letter it is is 
arbitrary right?  Is there a name for what "n" is? For a while there, I 
thought it was *my_node_ID.  *
 

>  Here is some sort of cypher….
>
>>  
>> //Creating the nodes
>>
>>  
>>
>> USING PERIODIC COMMIT 1000 
>>
>> LOAD CSV WITH HEADERS FROM “…/././…. CLT_NODES_LabelA.csv" AS csvline 
>>
>> MERGE (my_node_id:Skewer: LabelA {property1: csvline.property1}) 
>>
>> ON CREATE SET  
>>
>> n.Property2 = csvline.Property2,  
>>
>> n.Property3 = csvline.Property3,  
>>
>> n.Property4 = csvline.Property4; 
>>
>>
>> ….
>> LOAD CSV WITH HEADERS FROM “…/././…. CLT_NODES_LabelJ.csv" AS csvline
>>
>>  
>>
>> MERGE (my_node_id:Skewer: LabelJ {property1: csvline.property1}) 
>>
>> ON CREATE SET  
>>
>> n.Property2 = csvline.Property2,  
>>
>> n.Property3 = csvline.Property3,  
>>
>> n.Property4 = csvline.Property4;
>>
>>
>>  
>>
>> Q4: So does repeating the LOAD CSV with each file CLT_NODES_LabelA…J 
>> combine the various labels and their respective values with their 
>> corresponding nodes? 
>>
>
> Label is not a variable, it does not have a value. It's just a label, 
> consider "tag".
> Also *my_node_id* IS a variable so it does have a value.
>
>
JFM: OK, I am not understanding this.  I understood a "Label" as a general 
category for a node.  This was as opposed to a "Property" that was specific 
to a particular node.  As I understood it, a "Label" has different values. 
 So that Label could be "Category" and there could be two categories, for 
example...  CLT_SOURCE and CLT_TARGET .    I thought that makes it like a 
variable.  If not, the label is all the same on a given set of nodes and 
what's the point in that?
 
JFM: OK, I get that *my_node_id *is a variable.  
 

> Looking at your 2 code snippets - in case you hope that the first one will 
> create a node with LabelA and the second one will assign LabelJ to a node 
> which was created earlier, you are wrong. 
>
 

> But... if you remove labels from MERGE, it will work but look here with 
> attention:
>
> LOAD CSV WITH HEADERS FROM "…/././…. CLT_NODES_LabelA.csv" AS csvline
> MERGE (new_node_A:Skewer: {my_node_id: csvline.node_unique_number, 
> property1: csvline.property1})
> // only my_node_id and property1 values will be taken into account! no 
> labels, no other properties are taken care of
> // AFAIR we do not need `ON CREATE SET` here, do you really care is it a 
> new node or it was created earlier?  
> SET
> new_node_A : LabelA,
> new_node_A.Property2 = csvline.Property2,  
> new_node_A.Property3 = csvline.Property3,  
> new_node_A.Property4 = csvline.Property4;
>
> LOAD CSV WITH HEADERS FROM "…/././…. CLT_NODES_LabelJ.csv" AS csvline
> MERGE (new_node_J:Skewer: {my_node_id: csvline.node_unique_number, 
> property1: csvline.property1})
> // only my_node_id and property1 values will be taken into account! no 
> labels, no other properties are taken care of
> // AFAIR we do not need `ON CREATE SET` here, do you really care is it a 
> new node or it was created earlier?  
> SET
> new_node_J : LabelJ,
> new_node_J.Property2 = csvline.Property2,  
> new_node_J.Property3 = csvline.Property3,  
> new_node_J.Property4 = csvline.Property4;
>
>
> What you get if doing things this way:
>
>
>    1. When doing LabelA .csv you will create whatever uniquely numbered 
>    nodes were not already in the database, fill their properties (or maybe 
>    overwrite them?) and label the node (be it new or existing one) with 
> LabelA 
>    - no matter what other labels did node (possibly) have,
>    
>  JFM: OK.  I get it.

>
>    1. When doing LabelJ .csv you *again *will create whatever uniquely 
>    numbered nodes were not already in the database, *again* either fill 
>    or overwrite propertiers, and *again* label the node (be it new or 
>    existing one) with LabelJ - no matter what other labels did node 
> (possibly) 
>    have,
>    
>  JFM: OK.  I get it.

>
>    1. so if you created some node with first file and labeled it LabelA, 
>    if the same unique *my_node_id *occur both in first and second files, 
>    your node will get 2 labels LabelA and LabelJ.
>    
> JFM: That's wha tI want!! 

>  
>
>> Q5: Since I think of my data in terms of the two classes of nodes in my 
>> Data model …[CLT_SOURCE —> CLT_TARGET ;  CLT_TARGET —>  CLT_SOURCE],  after 
>> loading the nodes, how then I get two classes of nodes?
>>
>
> Make them 2 labels: CLTSource and CLTTarget respectively.
>  
>

JFM: OK.  Regarding the labels...my csv file has a column called DESC that 
has two values CLT_SOURCE and CLT_TARGET.  You are saying that my Source cvs 
should have a CLT_SOURCE column and my target csv should have a CLT_TARGET 
column?  My csv files should NOT a configuration as I described?

JFM: Since my csv file has its A thru J columns  A (2) values, B (1), C (4) 
D (83), E (83), F (11) G (11) H (83) J (83), K (2), I should have ALOT of 
csv files instead of just two for nodes!

 

> Q6: Is there a step missing that explains how the code below got to have a 
>> “source_node” and a “dest_node” that appears to correspond to my CLT_SOURCE 
>> and CLT_TARGET nodes?
>>
>
> // suppose we coded relationships as 2 my_node_id's of nodes
> LOAD CSV FROM "...somewhere..." AS csvline
> MATCH (s:CLTSource:Skewer {my_node_id: TOINT(csvline[0)})
> USING INDEX s:Skewer(my_node_id)
> WITH s
> MATCH (t:CLTTarget:Skewer {my_node_id: TOINT(csvline[1)})
> USING INDEX t:Skewer(my_node_id)
> MERGE (s)-[r:MY_RELATIONSHIP_TYPE]->(t)
> SET
> r.prop1 = 'smth';
>
>
>
JFM: What I am not getting from this is there is one csv file that has the 
CLTSOURCE and CLTTARGET labels in it. That contradicts what I said above 
because that would make only 1 csv file.  I assume this there is one LOAD 
CSV statement and the my_node_ID:TOINT(csvline(0)})  and 
 my_node_ID:TOINT(csvline(1)}) refer presumably to two lines in that file.
 

>
> 4. Now when you are done with nodes and start doing LOAD CSV for 
>>> relationships, you may give the MATCH statement, which looks up your pair 
>>> of nodes, a hint for fast lookup, like
>>>
>>> LOAD CSV ...from somewhere... AS csvline
>>> MATCH (source_node:Skewer {my_node_id: ToInt(csvline[0]}), (dest_node:
>>> Skewer {my_node_id: ToInt(csvline[1]})
>>> CREATE (source_node)-[r:MY_REL_TYPE {rel_prop_00: csvline[2], ..., 
>>> rel_prop_NN: csvline[ZZ]}]->(dest_node);
>>>
>>>
>> Q6: This LOAD CSV  command (line 1) looks into the separate REL.csv file 
>> you mentioned first right?  
>>
>
> Yep
>  
>
>> Q7: csvline is some sort of temp file that is a series of lines of the 
>> cvs file? 
>>
>
> This is a variable - collection which is filled by column values of .csv 
> line by line. You can use it either as an array referring fields by their 
> index (my preferred way) - or, if you use `WITH HEADERS` mode, you can use 
> it as a keyed map. See 
> http://neo4j.com/docs/2.1.6/cypherdoc-importing-csv-files-with-cypher.html
>
> Q8: Do you imply in line 2 that the REL.csv file has headers that include  
>> source_node, dest_node ?
>>
>
> No I don't use headers so I refer to csvline fields by their index 
> ("collection mode")
>  
>
>> Q9: While I see how Skewer is a label,  how is my_node_id a  property 
>> (line 2) ? 
>>
>
> Because it IS a property of a node, and you build constraint & index *on 
> this exact property* inside the scope of a label :Skewer
>  
>

JFM: OK.
 

> Q10: How does my_node_id relate to either ToInt(csvline[0]} or 
>> ToInt(csvline[1]}  (line 2) ?
>>
>
> For .csv with relationships, csvline[0] is a value of *my_node_id *property 
> of the *source* node, csvline[1] is a value of *my_node_id *property of 
> the *target* node, and TOINT() type conversion is used because my 
> personal preference is to use integers for ids.
>  
>
>> Is it that ToInt(csvline[0]} refers to the a line of the REL.csv file?  
>>
>> Does csvline[0] refer to a column in REL.csv as do csvline[2] and 
>> csvline[ZZ] (line 3) ?
>>
>
>
JFM: OK, I think I get it.
 

> I think you can combine import of multiple .CSV files in a single LOAD CSV 
> statement but I didn't ever try this mode.
>
> WBR,
> Andrii
>  
>

JFM: Thanks!

 

> Adding *`:Skewer` *label in MATCH will tell Cypher to (implicitly) use 
>>> your index on *my_node_id* which was created when you created your 
>>> constraint. Or you may try to explicitly give it a hint to use the index, 
>>> with USING INDEX... clause after MATCH before CREATE. Btw some earlier 
>>> versions of Neo4j refused to use index in LOAD CSV for some reason, I hope 
>>> this problem is gone with 2.1.5.
>>>
>>> OK
>>  
>>
>>> 5. While importing, be careful to *explicitly specify type conversions 
>>> for each property which is not a string*. I have seen numerous 
>>> occasions when people missed ToInt(csvline[i]) or ToFloat(csvline[j]) - and 
>>> Cypher silently stored their (supposed) numerics as strings. It's Ok, dude, 
>>> you say it :) This led to confusion afterwards when say numerical 
>>> comparisons doesn't MATCH and so on (though it's easy to correct with a 
>>> single Cypher command, but anyway).
>>>
>>> Think I did that re. type conversion.  Only applies to properties for my 
>> data.
>>   
>> Sorry for so many questions.  I am really interested in figuring this out!
>>
>> Thanks loads,  
>> Jose
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[Neo4j] Re: large cypher statements

Reply via email to