On Saturday, November 29, 2014 6:35:33 AM UTC-5, Andrii Stesin wrote:
>
> Hi Jose,
>
> On Saturday, November 29, 2014 1:12:52 AM UTC+2, José F. Morales wrote:
>>
>>
>>> 1. On nodes and their labels. First of all, I strongly suggest you to
>>> separate your nodes into different .csv files by label. So you won't have a
>>> column *`label`* in your .csv but rather set of files:
>>>
>>> nodes_LabelA.csv
>>> ...
>>> nodes_LabelZ.csv
>>>
>>> whatever your labels are. (Consider label to be kinda of synonym for
>>> `class` in object-oriented programming or `table` in RDBMS). That's due the
>>> fact that labels in Cypher are somewhat specific entities and you probably
>>> won't be allowed to make them parameterized into variables inside your LOAD
>>> CSV statement.
>>>
>>>
>> OK, so you have modified your original idea of putting the db into two
>> files 1 nodes , 1 relationships. Now here you say, put all the nodes into
>> 1 file/ label. The way I have worked with it, I created 1 file for a
>> class of nodes I'll call CLT_SOURCE and another file for a class of nodes
>> called CLT_TARGET.
>>
>
> Ok, but how many valid distinct combinations of your 10 node labels may
> exist?
>
JFM: 264
> I was speaking about a simple case where you have some limited number of
> possible node labels (or their combinations), say less than 10.
>
JFM: Lot more than that.
>
> You are recommending that with the nodes, I take two steps...
>
>> 1) Combine my CLT_SOURCE and CLT_TARGET nodes,
>>
>
> Not nessesary "combine" but just give each node a unique (temporary)
> *my_node_id
> *see my "10+M tree" example below.
>
>
>> 2) then I split that file into files that correspond to the node:
>> *my_node_id, * 1 label, and then properties P1...Pn. Since I have 10
>> Labels/node, I should have 10 files named..... Nodes_LabelA...
>> Nodes_LabelJ. Thus...
>>
>
> You may have as much labels per node you wish, but it is all about how
> many valid distinct combinations of labels you have. (One single label is a
> combination itself, obviously).
>
> If you have some limited quantity of valid label combination it's one
> story. But if we are talking about order of 10! possible valid
> combinations, the story is somewhat more interesting :) Which setup is
> yours?
>
JFM: Like I said, there are 264 unique combinations in all my nodes. Some
are redundant, full spelling of a term/phrase and an abbreviation. Some
are a code for a term/phrase. Some were created in anticipation of others
values I would create later. I am trying to anticipate queries I'll make
later.
>
>
>> File: CLT_Nodes-LabelA columns: *my_node_id,* label A, property
>> P1..., property P4
>> ...
>> File: CLT_Nodes-LabelJ columns: *my_node_id,* label B, property
>> P1..., property P4
>>
>>
>> Q1: What are the rules about what can be used for *my_node_id? *I have
>> usually seen them as a letter integer combination. Is that the convention?
>> Sometimes I've seen a letter being used with a specific class of nodes
>> a1..a100 for one class and b1..b100 for another. I learned the hard way
>> that you have to give each node a unique ID. I used CLT_1...CLT_n for my
>> CLT_SOURCE nodes and CLT_TARGET_1...CLT_TARGET_n for my TARGET nodes.
>> It worked with the smaller db I made. Anything wrong using the convention
>> n1...n100?
>>
>
> I'm not aware of any conventions here, the only thing I know for sure is
> that *schema index works much(!) faster on plain integers than on Unicode
> strings*. That's the only difference which I consider significant. So my
> personal preference is to have *my_node_id* to be a unique integer. Once
> when importing a 10+ millions nodes into a tree with variable height [1..7]
> where each level of nodes was in a separate file (because of level's own
> unique label and unique set of properties) I just selected a schema for
> numbering them like
>
>
JFM: Makes sense for speed. I guess it depends upon the size of one's data.
> :Skewer:Level1 my_node_id = 10000000 + file1.csv line number
> :Skewer:Level2 my_node_id = 20000000 + file2.csv line number
> ...
> :Skewer:Level7 my_node_id = 70000000 + file7.csv line number
>
> so relationship file (all relationships were of a same single type) has
> become a simple 2 column .csv like this with 10+ millions of lines
>
> 10000017,20000362
> 10000017,20000547
> 10000017,40083215
> 10000018,30000397
> ...
>
> After successful importing of 7 node files (and have nodes ready in db and
> indexed on their unique *my_node_id* under the label :Skewer) I split
> relationships.csv into 1000+ files with 10000 lines each and wrote a dumb
> shell script which loaded them with `neo4j-shell -c` file by file doing
> `sleep 60` between files (to give neo4j a minute to complete each batch
> transaction) than started it Friday evening and got my tree ready on Monday
> morning :)
>
> If you prefer alphanumerics for my_node_id it's completely up to you :)
> Anyway, after successful import you may prefer to remove those temporary
> ids completely from the database, just to conserve space where properties
> are stored.
>
>
JFM: OK. Sounds good.
> 2. Then consider one additional "technological" label, let's name it
>>> `:Skewer` because it will "penetrate" all your nodes of every different
>>> label (class) like a kebab skewer.
>>>
>>> Before you start (or at least before you start importing relationships)
>>> do
>>>
>>> CREATE CONSTRAINT ON (every_node:Skewer) ASSERT every_node.my_node_id
>>> IS UNIQUE;
>>>
>>>
>> Q2: Should I do scenario 1 or 2?
>>
>> Scenario 1: add two labels to each file? One from my original nodes and
>> one as "Skewer"
>>
>> File 1: CLT_Nodes-LabelA columns: *my_node_id,* label A, *Skewer*,
>> property P1..., property P4
>> ...
>> File 2: CLT_Nodes-LabelJ columns: *my_node_id,* label J, *Skewer*,
>> property P1..., property P4
>>
>> OR
>>
>> Scenario 2: Include an eleventh file thus....
>>
>> File 11: CLT_Nodes-LabelK columns: *my_node_id,* *Skewer*,
>> property P1..., property P4
>>
>> From below, I think you mean Scenario 1.
>>
>
> Yes and you don't need to add a column for :Skewer label into a file, the
> LOAD CSV statement should assign it.
>
>
JFM: OK. Sounds good.
> Q3: “Skewer” is just an integer right? It corresponds in a way to
>> my_node_id
>>
>
> No, it's a label! so in Cypher your node (suppose it has 2 labels :LabelA
> and :LabelJ ) is described like
>
> MATCH (n:LabelA:LabelJ:Skewer {my_node_id: 123454, p1: 'something', p2:
> 'something
> else', p3: 'etc.'})
>
>
JFM: Got that!
JFM: ok basic question... MATCH (n: <---What is "n"? Does it just
indicate that its a node of a particular class? What letter it is is
arbitrary right? Is there a name for what "n" is? For a while there, I
thought it was *my_node_ID. *
> Here is some sort of cypher….
>
>>
>> //Creating the nodes
>>
>>
>>
>> USING PERIODIC COMMIT 1000
>>
>> LOAD CSV WITH HEADERS FROM “…/././…. CLT_NODES_LabelA.csv" AS csvline
>>
>> MERGE (my_node_id:Skewer: LabelA {property1: csvline.property1})
>>
>> ON CREATE SET
>>
>> n.Property2 = csvline.Property2,
>>
>> n.Property3 = csvline.Property3,
>>
>> n.Property4 = csvline.Property4;
>>
>>
>> ….
>> LOAD CSV WITH HEADERS FROM “…/././…. CLT_NODES_LabelJ.csv" AS csvline
>>
>>
>>
>> MERGE (my_node_id:Skewer: LabelJ {property1: csvline.property1})
>>
>> ON CREATE SET
>>
>> n.Property2 = csvline.Property2,
>>
>> n.Property3 = csvline.Property3,
>>
>> n.Property4 = csvline.Property4;
>>
>>
>>
>>
>> Q4: So does repeating the LOAD CSV with each file CLT_NODES_LabelA…J
>> combine the various labels and their respective values with their
>> corresponding nodes?
>>
>
> Label is not a variable, it does not have a value. It's just a label,
> consider "tag".
> Also *my_node_id* IS a variable so it does have a value.
>
>
JFM: OK, I am not understanding this. I understood a "Label" as a general
category for a node. This was as opposed to a "Property" that was specific
to a particular node. As I understood it, a "Label" has different values.
So that Label could be "Category" and there could be two categories, for
example... CLT_SOURCE and CLT_TARGET . I thought that makes it like a
variable. If not, the label is all the same on a given set of nodes and
what's the point in that?
JFM: OK, I get that *my_node_id *is a variable.
> Looking at your 2 code snippets - in case you hope that the first one will
> create a node with LabelA and the second one will assign LabelJ to a node
> which was created earlier, you are wrong.
>
> But... if you remove labels from MERGE, it will work but look here with
> attention:
>
> LOAD CSV WITH HEADERS FROM "…/././…. CLT_NODES_LabelA.csv" AS csvline
> MERGE (new_node_A:Skewer: {my_node_id: csvline.node_unique_number,
> property1: csvline.property1})
> // only my_node_id and property1 values will be taken into account! no
> labels, no other properties are taken care of
> // AFAIR we do not need `ON CREATE SET` here, do you really care is it a
> new node or it was created earlier?
> SET
> new_node_A : LabelA,
> new_node_A.Property2 = csvline.Property2,
> new_node_A.Property3 = csvline.Property3,
> new_node_A.Property4 = csvline.Property4;
>
> LOAD CSV WITH HEADERS FROM "…/././…. CLT_NODES_LabelJ.csv" AS csvline
> MERGE (new_node_J:Skewer: {my_node_id: csvline.node_unique_number,
> property1: csvline.property1})
> // only my_node_id and property1 values will be taken into account! no
> labels, no other properties are taken care of
> // AFAIR we do not need `ON CREATE SET` here, do you really care is it a
> new node or it was created earlier?
> SET
> new_node_J : LabelJ,
> new_node_J.Property2 = csvline.Property2,
> new_node_J.Property3 = csvline.Property3,
> new_node_J.Property4 = csvline.Property4;
>
>
> What you get if doing things this way:
>
>
> 1. When doing LabelA .csv you will create whatever uniquely numbered
> nodes were not already in the database, fill their properties (or maybe
> overwrite them?) and label the node (be it new or existing one) with
> LabelA
> - no matter what other labels did node (possibly) have,
>
> JFM: OK. I get it.
>
> 1. When doing LabelJ .csv you *again *will create whatever uniquely
> numbered nodes were not already in the database, *again* either fill
> or overwrite propertiers, and *again* label the node (be it new or
> existing one) with LabelJ - no matter what other labels did node
> (possibly)
> have,
>
> JFM: OK. I get it.
>
> 1. so if you created some node with first file and labeled it LabelA,
> if the same unique *my_node_id *occur both in first and second files,
> your node will get 2 labels LabelA and LabelJ.
>
> JFM: That's wha tI want!!
>
>
>> Q5: Since I think of my data in terms of the two classes of nodes in my
>> Data model …[CLT_SOURCE —> CLT_TARGET ; CLT_TARGET —> CLT_SOURCE], after
>> loading the nodes, how then I get two classes of nodes?
>>
>
> Make them 2 labels: CLTSource and CLTTarget respectively.
>
>
JFM: OK. Regarding the labels...my csv file has a column called DESC that
has two values CLT_SOURCE and CLT_TARGET. You are saying that my Source cvs
should have a CLT_SOURCE column and my target csv should have a CLT_TARGET
column? My csv files should NOT a configuration as I described?
JFM: Since my csv file has its A thru J columns A (2) values, B (1), C (4)
D (83), E (83), F (11) G (11) H (83) J (83), K (2), I should have ALOT of
csv files instead of just two for nodes!
> Q6: Is there a step missing that explains how the code below got to have a
>> “source_node” and a “dest_node” that appears to correspond to my CLT_SOURCE
>> and CLT_TARGET nodes?
>>
>
> // suppose we coded relationships as 2 my_node_id's of nodes
> LOAD CSV FROM "...somewhere..." AS csvline
> MATCH (s:CLTSource:Skewer {my_node_id: TOINT(csvline[0)})
> USING INDEX s:Skewer(my_node_id)
> WITH s
> MATCH (t:CLTTarget:Skewer {my_node_id: TOINT(csvline[1)})
> USING INDEX t:Skewer(my_node_id)
> MERGE (s)-[r:MY_RELATIONSHIP_TYPE]->(t)
> SET
> r.prop1 = 'smth';
>
>
>
JFM: What I am not getting from this is there is one csv file that has the
CLTSOURCE and CLTTARGET labels in it. That contradicts what I said above
because that would make only 1 csv file. I assume this there is one LOAD
CSV statement and the my_node_ID:TOINT(csvline(0)}) and
my_node_ID:TOINT(csvline(1)}) refer presumably to two lines in that file.
>
> 4. Now when you are done with nodes and start doing LOAD CSV for
>>> relationships, you may give the MATCH statement, which looks up your pair
>>> of nodes, a hint for fast lookup, like
>>>
>>> LOAD CSV ...from somewhere... AS csvline
>>> MATCH (source_node:Skewer {my_node_id: ToInt(csvline[0]}), (dest_node:
>>> Skewer {my_node_id: ToInt(csvline[1]})
>>> CREATE (source_node)-[r:MY_REL_TYPE {rel_prop_00: csvline[2], ...,
>>> rel_prop_NN: csvline[ZZ]}]->(dest_node);
>>>
>>>
>> Q6: This LOAD CSV command (line 1) looks into the separate REL.csv file
>> you mentioned first right?
>>
>
> Yep
>
>
>> Q7: csvline is some sort of temp file that is a series of lines of the
>> cvs file?
>>
>
> This is a variable - collection which is filled by column values of .csv
> line by line. You can use it either as an array referring fields by their
> index (my preferred way) - or, if you use `WITH HEADERS` mode, you can use
> it as a keyed map. See
> http://neo4j.com/docs/2.1.6/cypherdoc-importing-csv-files-with-cypher.html
>
> Q8: Do you imply in line 2 that the REL.csv file has headers that include
>> source_node, dest_node ?
>>
>
> No I don't use headers so I refer to csvline fields by their index
> ("collection mode")
>
>
>> Q9: While I see how Skewer is a label, how is my_node_id a property
>> (line 2) ?
>>
>
> Because it IS a property of a node, and you build constraint & index *on
> this exact property* inside the scope of a label :Skewer
>
>
JFM: OK.
> Q10: How does my_node_id relate to either ToInt(csvline[0]} or
>> ToInt(csvline[1]} (line 2) ?
>>
>
> For .csv with relationships, csvline[0] is a value of *my_node_id *property
> of the *source* node, csvline[1] is a value of *my_node_id *property of
> the *target* node, and TOINT() type conversion is used because my
> personal preference is to use integers for ids.
>
>
>> Is it that ToInt(csvline[0]} refers to the a line of the REL.csv file?
>>
>> Does csvline[0] refer to a column in REL.csv as do csvline[2] and
>> csvline[ZZ] (line 3) ?
>>
>
>
JFM: OK, I think I get it.
> I think you can combine import of multiple .CSV files in a single LOAD CSV
> statement but I didn't ever try this mode.
>
> WBR,
> Andrii
>
>
JFM: Thanks!
> Adding *`:Skewer` *label in MATCH will tell Cypher to (implicitly) use
>>> your index on *my_node_id* which was created when you created your
>>> constraint. Or you may try to explicitly give it a hint to use the index,
>>> with USING INDEX... clause after MATCH before CREATE. Btw some earlier
>>> versions of Neo4j refused to use index in LOAD CSV for some reason, I hope
>>> this problem is gone with 2.1.5.
>>>
>>> OK
>>
>>
>>> 5. While importing, be careful to *explicitly specify type conversions
>>> for each property which is not a string*. I have seen numerous
>>> occasions when people missed ToInt(csvline[i]) or ToFloat(csvline[j]) - and
>>> Cypher silently stored their (supposed) numerics as strings. It's Ok, dude,
>>> you say it :) This led to confusion afterwards when say numerical
>>> comparisons doesn't MATCH and so on (though it's easy to correct with a
>>> single Cypher command, but anyway).
>>>
>>> Think I did that re. type conversion. Only applies to properties for my
>> data.
>>
>> Sorry for so many questions. I am really interested in figuring this out!
>>
>> Thanks loads,
>> Jose
>>
>
--
You received this message because you are subscribed to the Google Groups
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.