[Neo4j] Re: large cypher statements

José F . Morales Thu, 04 Dec 2014 15:09:42 -0800

Andrii and Michael,

Sorry for the delay in response. I was a little under the weather.   
ANYHOW, it looks like I figured out how to do the data loading! I was 
trying several approaches and the one using Michael's shell tools seems to 
have worked! There were info from Andrii that proved important as well! 
(my_node_ID as integer).  The loading of the 18k NODES was in seconds. When 
I tested the RELS with a tiny data set it worked perfectly.  I am cleaning 
up the 52k RELS file after the first attempt failed because of a missing " 
 '  ".


My only issue is that the RELs loading is slow....

commit after 1000 row(s)  0. 1%: nodes = 0 rels = 1000 properties = 7000 
time 7059450 ms total 7059450 ms

Now I thought that if I created an index (below), it would be faster. 
Apparently not.  

neo4j-sh (?)$ auto-index LC_ID

Enabling auto-indexing of Node properties: [LC_ID]

Do I have this wrong?  Should it have been CREATE INDEX ON :LC_ID?

Jose

On Monday, December 1, 2014 5:09:36 PM UTC-5, Andrii Stesin wrote:
>
> Hi José,
>
> On Monday, December 1, 2014 12:33:58 AM UTC+2, José F. Morales wrote:
>>
>> Ok, but how many valid distinct combinations of your 10 node labels may 
>>> exist? 
>>>
>>
>> JFM: 264
>>
>
> This makes me think that maybe your target data model needs some 
> refactoring. What are the entities (classes), and what can be better 
> considered as attributes? Again, I'm not familiar with LabCard, so in case 
> you give some explanations and a sample dataset which is publicly 
> available, I'd take a close look at it.
>  
>
>> JFM:  Like I said, there are 264 unique combinations in all my nodes. 
>>> Some are redundant, full spelling of a term/phrase and an abbreviation. 
>>>  Some are a code for a term/phrase.  Some were created in anticipation of 
>>> others values I would create later.  I am trying to anticipate queries I'll 
>>> make later.
>>>
>>
> Once again, I foresee a data modelling issue here.
>  
>
>> JFM: Makes sense for speed. I guess it depends upon the size of one's 
>>>> data.
>>>>
>>>
> Sure it does :)
>  
>
>> Q3: “Skewer” is just an integer right?  It corresponds in a way to 
>>>> my_node_id 
>>>>
>>>
>>> No, it's a label! so in Cypher your node (suppose it has 2 labels 
>>> :LabelA and :LabelJ ) is described like
>>>
>>> MATCH (n:LabelA:LabelJ:Skewer {my_node_id: 123454, p1: 'something', p2: 
>>> 'something 
>>> else', p3: 'etc.'})
>>>
>>>
>> JFM: Got that!
>>
>> JFM: ok basic question...  MATCH (n:  <---What is "n"? Does it just 
>> indicate that its a node of a particular class?  What letter it is is 
>> arbitrary right?  Is there a name for what "n" is? For a while there, I 
>> thought it was *my_node_ID.  *
>>
>
> *n* is just a name of the variable. Cypher, like any other programming 
> language, has a notion of "variable" which has it's name and which cat take 
> different values; here I've choose *n* just occasionally for the variable 
> name.
>  
>
>> Q4: So does repeating the LOAD CSV with each file CLT_NODES_LabelA…J 
>>>> combine the various labels and their respective values with their 
>>>> corresponding nodes? 
>>>>
>>>
>>> Label is not a variable, it does not have a value. It's just a label, 
>>> consider "tag".
>>> Also *my_node_id* IS a variable so it does have a value.
>>>
>>
>> JFM: OK, I am not understanding this.  I understood a "Label" as a 
>> general category for a node. 
>>
>
> That's Ok, or maybe even better is to imagine a tag. Node may have 
> multiple tags (labels), they can be added and/or removed.
>  
>
>> This was as opposed to a "Property" that was specific to a particular 
>> node.  As I understood it, a "Label" has different values.
>>
>
> Label is just a label. It doesn't have any value itself, it just marks 
> (tags) some (sub)set of your nodes and allows you to distinguish between 
> them. Labels may overlap. Consider automotive domain, and let's take a look 
> for data model for it.
>
> Brand seems to better be modelled as a label. Say `Opel`, `Volvo` or 
> `Peugeout`.
> Kind of vehicle is definitely(???) a label. Say `Truck`, `SUV`, `Car`.
> How to model some deeper things, depends on what you are going to achieve.
> Is body color a label or property? Which approach is better: either
>
> MATCH (vhcl:Truck:Volvo {body_color: 'red', VIN: 'VE18727673826812634X65' 
> })
>
> or
>
> MATCH (vhcl:Opel:Yellow:SUV {VIN: 'VE18727673826812634X65'})
>
> ? I'm not sure, it depends on the goal, as for me I'd prefer color to be a 
> property of some exact single car (once you can decide to paint your yellow 
> car in white or some other color, after all)
>
> But VIN is *definitely* a property of one exact single car.
>
> Is car license plate a label or property? Definitely none of either, 
> because you can sell your car and new owner will get another license plate 
> for it, so I'd model this as
>
> MATCH (vhcl:Car:Ford {body_color: 'pink', VIN: 'FGT87356873HU8745'})-[:
> HAS_LICENSE_PLATE]->(lp:LicensePlate {state: 'AL', str: 'WH4TWR'})
>
>
> but as you see `LicensePlate` obviously should not be ever mixed with 
> either `Car` or `Truck`, so they are different labels which do not 
> intersect.
>
> So that Label could be "Category" and there could be two categories, for 
>> example...  CLT_SOURCE and CLT_TARGET .    I thought that makes it like a 
>> variable.  If not, the label is all the same on a given set of nodes and 
>> what's the point in that?
>>  
>> JFM: OK, I get that *my_node_id *is a variable.  
>>
>
> Agh, exactly.
>  
>
>>
>>>    1. When doing LabelA .csv you will create whatever uniquely numbered 
>>>    nodes were not already in the database, fill their properties (or maybe 
>>>    overwrite them?) and label the node (be it new or existing one) with 
>>> LabelA 
>>>    - no matter what other labels did node (possibly) have,
>>>    
>>>  JFM: OK.  I get it.
>>
>>>
>>>    1. When doing LabelJ .csv you *again *will create whatever uniquely 
>>>    numbered nodes were not already in the database, *again* either fill 
>>>    or overwrite propertiers, and *again* label the node (be it new or 
>>>    existing one) with LabelJ - no matter what other labels did node 
>>> (possibly) 
>>>    have,
>>>    
>>>  JFM: OK.  I get it.
>>
>>>
>>>    1. so if you created some node with first file and labeled it 
>>>    LabelA, if the same unique *my_node_id *occur both in first and 
>>>    second files, your node will get 2 labels LabelA and LabelJ.
>>>    
>>> JFM: That's wha tI want!! 
>>
>
> Huh, Ok so far :)
>  
>
>> Q5: Since I think of my data in terms of the two classes of nodes in my 
>>>> Data model …[CLT_SOURCE —> CLT_TARGET ;  CLT_TARGET —>  CLT_SOURCE],  
>>>> after 
>>>> loading the nodes, how then I get two classes of nodes?
>>>>
>>>
>>> Make them 2 labels: CLTSource and CLTTarget respectively.
>>>
>>
>> JFM: OK.  Regarding the labels...my csv file has a column called DESC 
>> that has two values CLT_SOURCE and CLT_TARGET.  You are saying that my 
>> Source cvs should have a CLT_SOURCE column and my target csv should have 
>> a CLT_TARGET column?  My csv files should NOT a configuration as I 
>> described?
>>
>
> What does CLT really mean in the real life? I failed to parse :( sorry for 
> that. Once again, in case you describe the LabCard domain and provide me 
> with a dataset, I'd be able to make you some better ideas (this also may 
> become a good tutorial sample case for future Neo4j users).
>  
>
>> JFM: Since my csv file has its A thru J columns  A (2) values, B (1), C 
>> (4) D (83), E (83), F (11) G (11) H (83) J (83), K (2), I should have ALOT 
>> of csv files instead of just two for nodes!
>>
>
> Again, I strongly suspect a data modelling issue here.
>   
>
>> JFM: What I am not getting from this is there is one csv file that has 
>>>> the CLTSOURCE and CLTTARGET labels in it. That contradicts what I said 
>>>> above because that would make only 1 csv file.  I assume this there is one 
>>>> LOAD CSV statement and the my_node_ID:TOINT(csvline(0)})  and 
>>>>  my_node_ID:TOINT(csvline(1)}) refer presumably to two lines in that file.
>>>>
>>>
> As soon as you have both src and target nodes already inside the database, 
> you need a .csv file which describes only relationships in terms of 1st 
> column contains src nodes ids, 2d column contains dst nodes ids and thus 1 
> row of .csv describes 1 single relationship per (linked) pair of nodes.
>
> For .csv with relationships, csvline[0] is a value of *my_node_id *property 
>>>>> of the *source* node, csvline[1] is a value of *my_node_id *property 
>>>>> of the *target* node, and TOINT() type conversion is used because my 
>>>>> personal preference is to use integers for ids.
>>>>>
>>>>  
>>>
>>>> Is it that ToInt(csvline[0]} refers to the a line of the REL.csv file?  
>>>>
>>>> Does csvline[0] refer to a column in REL.csv as do csvline[2] and 
>>>> csvline[ZZ] (line 3) ?
>>>>
>>>
>>>
>> JFM: OK, I think I get it.
>>  
>>
>>> I think you can combine import of multiple .CSV files in a single LOAD 
>>> CSV statement but I didn't ever try this mode.
>>>
>>> WBR,
>>> Andrii
>>>  
>>>
>>
>> JFM: Thanks!
>>
>
> :)
>
> WBR,
> Andrii
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[Neo4j] Re: large cypher statements

Reply via email to