[Neo4j] Re: large cypher statements

Andrii Stesin Mon, 01 Dec 2014 14:09:59 -0800

Hi José,

On Monday, December 1, 2014 12:33:58 AM UTC+2, José F. Morales wrote:
>
> Ok, but how many valid distinct combinations of your 10 node labels may 
>> exist? 
>>
>
> JFM: 264
>


This makes me think that maybe your target data model needs some 
refactoring. What are the entities (classes), and what can be better 
considered as attributes? Again, I'm not familiar with LabCard, so in case 
you give some explanations and a sample dataset which is publicly 
available, I'd take a close look at it.
 

> JFM:  Like I said, there are 264 unique combinations in all my nodes. Some 
>> are redundant, full spelling of a term/phrase and an abbreviation.  Some 
>> are a code for a term/phrase.  Some were created in anticipation of others 
>> values I would create later.  I am trying to anticipate queries I'll make 
>> later.
>>
>
Once again, I foresee a data modelling issue here.
 

> JFM: Makes sense for speed. I guess it depends upon the size of one's data.
>>>
>>
Sure it does :)
 

> Q3: “Skewer” is just an integer right?  It corresponds in a way to 
>>> my_node_id 
>>>
>>
>> No, it's a label! so in Cypher your node (suppose it has 2 labels :LabelA 
>> and :LabelJ ) is described like
>>
>> MATCH (n:LabelA:LabelJ:Skewer {my_node_id: 123454, p1: 'something', p2: 
>> 'something 
>> else', p3: 'etc.'})
>>
>>
> JFM: Got that!
>
> JFM: ok basic question...  MATCH (n:  <---What is "n"? Does it just 
> indicate that its a node of a particular class?  What letter it is is 
> arbitrary right?  Is there a name for what "n" is? For a while there, I 
> thought it was *my_node_ID.  *
>

*n* is just a name of the variable. Cypher, like any other programming 
language, has a notion of "variable" which has it's name and which cat take 
different values; here I've choose *n* just occasionally for the variable 
name.
 

> Q4: So does repeating the LOAD CSV with each file CLT_NODES_LabelA…J 
>>> combine the various labels and their respective values with their 
>>> corresponding nodes? 
>>>
>>
>> Label is not a variable, it does not have a value. It's just a label, 
>> consider "tag".
>> Also *my_node_id* IS a variable so it does have a value.
>>
>
> JFM: OK, I am not understanding this.  I understood a "Label" as a general 
> category for a node. 
>

That's Ok, or maybe even better is to imagine a tag. Node may have multiple 
tags (labels), they can be added and/or removed.
 

> This was as opposed to a "Property" that was specific to a particular 
> node.  As I understood it, a "Label" has different values.
>

Label is just a label. It doesn't have any value itself, it just marks 
(tags) some (sub)set of your nodes and allows you to distinguish between 
them. Labels may overlap. Consider automotive domain, and let's take a look 
for data model for it.

Brand seems to better be modelled as a label. Say `Opel`, `Volvo` or 
`Peugeout`.
Kind of vehicle is definitely(???) a label. Say `Truck`, `SUV`, `Car`.
How to model some deeper things, depends on what you are going to achieve.
Is body color a label or property? Which approach is better: either

MATCH (vhcl:Truck:Volvo {body_color: 'red', VIN: 'VE18727673826812634X65' })

or

MATCH (vhcl:Opel:Yellow:SUV {VIN: 'VE18727673826812634X65'})

? I'm not sure, it depends on the goal, as for me I'd prefer color to be a 
property of some exact single car (once you can decide to paint your yellow 
car in white or some other color, after all)

But VIN is *definitely* a property of one exact single car.

Is car license plate a label or property? Definitely none of either, 
because you can sell your car and new owner will get another license plate 
for it, so I'd model this as

MATCH (vhcl:Car:Ford {body_color: 'pink', VIN: 'FGT87356873HU8745'})-[:
HAS_LICENSE_PLATE]->(lp:LicensePlate {state: 'AL', str: 'WH4TWR'})


but as you see `LicensePlate` obviously should not be ever mixed with 
either `Car` or `Truck`, so they are different labels which do not 
intersect.

So that Label could be "Category" and there could be two categories, for 
> example...  CLT_SOURCE and CLT_TARGET .    I thought that makes it like a 
> variable.  If not, the label is all the same on a given set of nodes and 
> what's the point in that?
>  
> JFM: OK, I get that *my_node_id *is a variable.  
>

Agh, exactly.
 

>
>>    1. When doing LabelA .csv you will create whatever uniquely numbered 
>>    nodes were not already in the database, fill their properties (or maybe 
>>    overwrite them?) and label the node (be it new or existing one) with 
>> LabelA 
>>    - no matter what other labels did node (possibly) have,
>>    
>>  JFM: OK.  I get it.
>
>>
>>    1. When doing LabelJ .csv you *again *will create whatever uniquely 
>>    numbered nodes were not already in the database, *again* either fill 
>>    or overwrite propertiers, and *again* label the node (be it new or 
>>    existing one) with LabelJ - no matter what other labels did node 
>> (possibly) 
>>    have,
>>    
>>  JFM: OK.  I get it.
>
>>
>>    1. so if you created some node with first file and labeled it LabelA, 
>>    if the same unique *my_node_id *occur both in first and second files, 
>>    your node will get 2 labels LabelA and LabelJ.
>>    
>> JFM: That's wha tI want!! 
>

Huh, Ok so far :)
 

> Q5: Since I think of my data in terms of the two classes of nodes in my 
>>> Data model …[CLT_SOURCE —> CLT_TARGET ;  CLT_TARGET —>  CLT_SOURCE],  after 
>>> loading the nodes, how then I get two classes of nodes?
>>>
>>
>> Make them 2 labels: CLTSource and CLTTarget respectively.
>>
>
> JFM: OK.  Regarding the labels...my csv file has a column called DESC that 
> has two values CLT_SOURCE and CLT_TARGET.  You are saying that my Source cvs 
> should have a CLT_SOURCE column and my target csv should have a 
> CLT_TARGET column?  My csv files should NOT a configuration as I 
> described?
>

What does CLT really mean in the real life? I failed to parse :( sorry for 
that. Once again, in case you describe the LabCard domain and provide me 
with a dataset, I'd be able to make you some better ideas (this also may 
become a good tutorial sample case for future Neo4j users).
 

> JFM: Since my csv file has its A thru J columns  A (2) values, B (1), C 
> (4) D (83), E (83), F (11) G (11) H (83) J (83), K (2), I should have ALOT 
> of csv files instead of just two for nodes!
>

Again, I strongly suspect a data modelling issue here.
  

> JFM: What I am not getting from this is there is one csv file that has the 
>>> CLTSOURCE and CLTTARGET labels in it. That contradicts what I said above 
>>> because that would make only 1 csv file.  I assume this there is one LOAD 
>>> CSV statement and the my_node_ID:TOINT(csvline(0)})  and 
>>>  my_node_ID:TOINT(csvline(1)}) refer presumably to two lines in that file.
>>>
>>
As soon as you have both src and target nodes already inside the database, 
you need a .csv file which describes only relationships in terms of 1st 
column contains src nodes ids, 2d column contains dst nodes ids and thus 1 
row of .csv describes 1 single relationship per (linked) pair of nodes.

For .csv with relationships, csvline[0] is a value of *my_node_id *property 
>>>> of the *source* node, csvline[1] is a value of *my_node_id *property 
>>>> of the *target* node, and TOINT() type conversion is used because my 
>>>> personal preference is to use integers for ids.
>>>>
>>>  
>>
>>> Is it that ToInt(csvline[0]} refers to the a line of the REL.csv file?  
>>>
>>> Does csvline[0] refer to a column in REL.csv as do csvline[2] and 
>>> csvline[ZZ] (line 3) ?
>>>
>>
>>
> JFM: OK, I think I get it.
>  
>
>> I think you can combine import of multiple .CSV files in a single LOAD 
>> CSV statement but I didn't ever try this mode.
>>
>> WBR,
>> Andrii
>>  
>>
>
> JFM: Thanks!
>

:)

WBR,
Andrii

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[Neo4j] Re: large cypher statements

Reply via email to