Re: [Neo4j] Re: large cypher statements

Michael Hunger Thu, 04 Dec 2014 16:00:23 -0800

No, not at all
auto-index is for legacy indexes
do the create index that I said


in your MATCH you _must_ provide the label then.

MATCH (LEFT_NODE:LABEL1 {LC_ID:{LEFT_NODE}}), (RIGHT_NODE:LABEL2 {LC_ID:{
RIGHT_NODE}})
..

You should also _never_ use #{} expressions for values, *only* for labels
and rel-types.
Only use Cypher parameters: {CAT}.

I also saw that you have a ton of relationship-properties. do you think you
need them all?
Perhaps there is also a Node / Entity actually hiding in your relationships?

Michael

On Fri, Dec 5, 2014 at 12:54 AM, José F. Morales <[email protected]> wrote:

> OK Fellas,
>
> What do you think of this?
>
> Did this first...
>
> auto-index LC_ID
>
> Then this...
>
> import-cypher -d , -i SAMPLE/Tz/Total_RELS_2.csv -b 1000  MATCH (LEFT_NODE
> {LC_ID:{LEFT_NODE}}), (RIGHT_NODE {LC_ID:{RIGHT_NODE}}) CREATE LEFT_NODE
> -[:#{REL}
> {PHYLUM:#{PHYLUM},CAT:#{CAT},UI_RL:#{UI_RL},RESULT:#{RESULT},INT_TYPE:#{INT_TYPE},DEG:toINT(#{DEG}),SDS_TD:toFloat(#{SDS_TD}),Path_L_TD:toINT(#{Path_L_TD}),Path_S_TD:#{Path_S_TD}}]->RIGHT_NODE
> return *
>
>
>
>
>
>
> On Thursday, December 4, 2014 6:27:53 PM UTC-5, Michael Hunger wrote:
>>
>> Perhaps you should show the statement too? Not just the log output? :)
>>
>> use this: CREATE INDEX ON :{Label}(LC_ID); <- replace with your label(s)
>>
>> On Fri, Dec 5, 2014 at 12:09 AM, José F. Morales <[email protected]>
>> wrote:
>>
>>> Andrii and Michael,
>>>
>>> Sorry for the delay in response. I was a little under the weather.
>>> ANYHOW, it looks like I figured out how to do the data loading! I was
>>> trying several approaches and the one using Michael's shell tools seems to
>>> have worked! There were info from Andrii that proved important as well!
>>> (my_node_ID as integer).  The loading of the 18k NODES was in seconds. When
>>> I tested the RELS with a tiny data set it worked perfectly.  I am cleaning
>>> up the 52k RELS file after the first attempt failed because of a missing "
>>>  '  ".
>>>
>>> My only issue is that the RELs loading is slow....
>>>
>>> commit after 1000 row(s)  0. 1%: nodes = 0 rels = 1000 properties = 7000
>>> time 7059450 ms total 7059450 ms
>>>
>>> Now I thought that if I created an index (below), it would be faster.
>>> Apparently not.
>>>
>>> neo4j-sh (?)$ auto-index LC_ID
>>>
>>> Enabling auto-indexing of Node properties: [LC_ID]
>>>
>>> Do I have this wrong?  Should it have been CREATE INDEX ON :LC_ID?
>>>
>>> Jose
>>>
>>>
>>> On Monday, December 1, 2014 5:09:36 PM UTC-5, Andrii Stesin wrote:
>>>>
>>>> Hi José,
>>>>
>>>> On Monday, December 1, 2014 12:33:58 AM UTC+2, José F. Morales wrote:
>>>>>
>>>>> Ok, but how many valid distinct combinations of your 10 node labels
>>>>>> may exist?
>>>>>>
>>>>>
>>>>> JFM: 264
>>>>>
>>>>
>>>> This makes me think that maybe your target data model needs some
>>>> refactoring. What are the entities (classes), and what can be better
>>>> considered as attributes? Again, I'm not familiar with LabCard, so in case
>>>> you give some explanations and a sample dataset which is publicly
>>>> available, I'd take a close look at it.
>>>>
>>>>
>>>>> JFM:  Like I said, there are 264 unique combinations in all my nodes.
>>>>>> Some are redundant, full spelling of a term/phrase and an abbreviation.
>>>>>> Some are a code for a term/phrase.  Some were created in anticipation of
>>>>>> others values I would create later.  I am trying to anticipate queries 
>>>>>> I'll
>>>>>> make later.
>>>>>>
>>>>>
>>>> Once again, I foresee a data modelling issue here.
>>>>
>>>>
>>>>> JFM: Makes sense for speed. I guess it depends upon the size of one's
>>>>>>> data.
>>>>>>>
>>>>>>
>>>> Sure it does :)
>>>>
>>>>
>>>>> Q3: “Skewer” is just an integer right?  It corresponds in a way to
>>>>>>> my_node_id
>>>>>>>
>>>>>>
>>>>>> No, it's a label! so in Cypher your node (suppose it has 2 labels
>>>>>> :LabelA and :LabelJ ) is described like
>>>>>>
>>>>>> MATCH (n:LabelA:LabelJ:Skewer {my_node_id: 123454, p1: 'something',
>>>>>> p2: 'something else', p3: 'etc.'})
>>>>>>
>>>>>>
>>>>> JFM: Got that!
>>>>>
>>>>> JFM: ok basic question...  MATCH (n:  <---What is "n"? Does it just
>>>>> indicate that its a node of a particular class?  What letter it is is
>>>>> arbitrary right?  Is there a name for what "n" is? For a while there, I
>>>>> thought it was *my_node_ID.  *
>>>>>
>>>>
>>>> *n* is just a name of the variable. Cypher, like any other programming
>>>> language, has a notion of "variable" which has it's name and which cat take
>>>> different values; here I've choose *n* just occasionally for the
>>>> variable name.
>>>>
>>>>
>>>>> Q4: So does repeating the LOAD CSV with each file CLT_NODES_LabelA…J
>>>>>>> combine the various labels and their respective values with their
>>>>>>> corresponding nodes?
>>>>>>>
>>>>>>
>>>>>> Label is not a variable, it does not have a value. It's just a label,
>>>>>> consider "tag".
>>>>>> Also *my_node_id* IS a variable so it does have a value.
>>>>>>
>>>>>
>>>>> JFM: OK, I am not understanding this.  I understood a "Label" as a
>>>>> general category for a node.
>>>>>
>>>>
>>>> That's Ok, or maybe even better is to imagine a tag. Node may have
>>>> multiple tags (labels), they can be added and/or removed.
>>>>
>>>>
>>>>> This was as opposed to a "Property" that was specific to a particular
>>>>> node.  As I understood it, a "Label" has different values.
>>>>>
>>>>
>>>> Label is just a label. It doesn't have any value itself, it just marks
>>>> (tags) some (sub)set of your nodes and allows you to distinguish between
>>>> them. Labels may overlap. Consider automotive domain, and let's take a look
>>>> for data model for it.
>>>>
>>>> Brand seems to better be modelled as a label. Say `Opel`, `Volvo` or
>>>> `Peugeout`.
>>>> Kind of vehicle is definitely(???) a label. Say `Truck`, `SUV`, `Car`.
>>>> How to model some deeper things, depends on what you are going to
>>>> achieve.
>>>> Is body color a label or property? Which approach is better: either
>>>>
>>>> MATCH (vhcl:Truck:Volvo {body_color: 'red', VIN:
>>>> 'VE18727673826812634X65' })
>>>>
>>>> or
>>>>
>>>> MATCH (vhcl:Opel:Yellow:SUV {VIN: 'VE18727673826812634X65'})
>>>>
>>>> ? I'm not sure, it depends on the goal, as for me I'd prefer color to
>>>> be a property of some exact single car (once you can decide to paint your
>>>> yellow car in white or some other color, after all)
>>>>
>>>> But VIN is *definitely* a property of one exact single car.
>>>>
>>>> Is car license plate a label or property? Definitely none of either,
>>>> because you can sell your car and new owner will get another license plate
>>>> for it, so I'd model this as
>>>>
>>>> MATCH (vhcl:Car:Ford {body_color: 'pink', VIN: 'FGT87356873HU8745'})-[:
>>>> HAS_LICENSE_PLATE]->(lp:LicensePlate {state: 'AL', str: 'WH4TWR'})
>>>>
>>>>
>>>> but as you see `LicensePlate` obviously should not be ever mixed with
>>>> either `Car` or `Truck`, so they are different labels which do not
>>>> intersect.
>>>>
>>>> So that Label could be "Category" and there could be two categories,
>>>>> for example...  CLT_SOURCE and CLT_TARGET .    I thought that makes it 
>>>>> like
>>>>> a variable.  If not, the label is all the same on a given set of nodes and
>>>>> what's the point in that?
>>>>>
>>>>> JFM: OK, I get that *my_node_id *is a variable.
>>>>>
>>>>
>>>> Agh, exactly.
>>>>
>>>>
>>>>>
>>>>>>    1. When doing LabelA .csv you will create whatever uniquely
>>>>>>    numbered nodes were not already in the database, fill their 
>>>>>> properties (or
>>>>>>    maybe overwrite them?) and label the node (be it new or existing one) 
>>>>>> with
>>>>>>    LabelA - no matter what other labels did node (possibly) have,
>>>>>>
>>>>>>  JFM: OK.  I get it.
>>>>>
>>>>>>
>>>>>>    1. When doing LabelJ .csv you *again *will create whatever
>>>>>>    uniquely numbered nodes were not already in the database, *again* 
>>>>>> either
>>>>>>    fill or overwrite propertiers, and *again* label the node (be it
>>>>>>    new or existing one) with LabelJ - no matter what other labels did 
>>>>>> node
>>>>>>    (possibly) have,
>>>>>>
>>>>>>  JFM: OK.  I get it.
>>>>>
>>>>>>
>>>>>>    1. so if you created some node with first file and labeled it
>>>>>>    LabelA, if the same unique *my_node_id *occur both in first and
>>>>>>    second files, your node will get 2 labels LabelA and LabelJ.
>>>>>>
>>>>>> JFM: That's wha tI want!!
>>>>>
>>>>
>>>> Huh, Ok so far :)
>>>>
>>>>
>>>>> Q5: Since I think of my data in terms of the two classes of nodes in
>>>>>>> my Data model …[CLT_SOURCE —> CLT_TARGET ;  CLT_TARGET —>  CLT_SOURCE],
>>>>>>> after loading the nodes, how then I get two classes of nodes?
>>>>>>>
>>>>>>
>>>>>> Make them 2 labels: CLTSource and CLTTarget respectively.
>>>>>>
>>>>>
>>>>> JFM: OK.  Regarding the labels...my csv file has a column called DESC
>>>>> that has two values CLT_SOURCE and CLT_TARGET.  You are saying that
>>>>> my Source cvs should have a CLT_SOURCE column and my target csv
>>>>> should have a CLT_TARGET column?  My csv files should NOT a
>>>>> configuration as I described?
>>>>>
>>>>
>>>> What does CLT really mean in the real life? I failed to parse :( sorry
>>>> for that. Once again, in case you describe the LabCard domain and provide
>>>> me with a dataset, I'd be able to make you some better ideas (this also may
>>>> become a good tutorial sample case for future Neo4j users).
>>>>
>>>>
>>>>> JFM: Since my csv file has its A thru J columns  A (2) values, B (1),
>>>>> C (4) D (83), E (83), F (11) G (11) H (83) J (83), K (2), I should have
>>>>> ALOT of csv files instead of just two for nodes!
>>>>>
>>>>
>>>> Again, I strongly suspect a data modelling issue here.
>>>>
>>>>
>>>>> JFM: What I am not getting from this is there is one csv file that has
>>>>>>> the CLTSOURCE and CLTTARGET labels in it. That contradicts what I said
>>>>>>> above because that would make only 1 csv file.  I assume this there is 
>>>>>>> one
>>>>>>> LOAD CSV statement and the my_node_ID:TOINT(csvline(0)})  and
>>>>>>>  my_node_ID:TOINT(csvline(1)}) refer presumably to two lines in that 
>>>>>>> file.
>>>>>>>
>>>>>>
>>>> As soon as you have both src and target nodes already inside the
>>>> database, you need a .csv file which describes only relationships in terms
>>>> of 1st column contains src nodes ids, 2d column contains dst nodes ids and
>>>> thus 1 row of .csv describes 1 single relationship per (linked) pair of
>>>> nodes.
>>>>
>>>> For .csv with relationships, csvline[0] is a value of *my_node_id *property
>>>>>>>> of the *source* node, csvline[1] is a value of *my_node_id *property
>>>>>>>> of the *target* node, and TOINT() type conversion is used because
>>>>>>>> my personal preference is to use integers for ids.
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>> Is it that ToInt(csvline[0]} refers to the a line of the REL.csv
>>>>>>> file?
>>>>>>>
>>>>>>> Does csvline[0] refer to a column in REL.csv as do csvline[2] and
>>>>>>> csvline[ZZ] (line 3) ?
>>>>>>>
>>>>>>
>>>>>>
>>>>> JFM: OK, I think I get it.
>>>>>
>>>>>
>>>>>> I think you can combine import of multiple .CSV files in a single
>>>>>> LOAD CSV statement but I didn't ever try this mode.
>>>>>>
>>>>>> WBR,
>>>>>> Andrii
>>>>>>
>>>>>>
>>>>>
>>>>> JFM: Thanks!
>>>>>
>>>>
>>>> :)
>>>>
>>>> WBR,
>>>> Andrii
>>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "Neo4j" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] Re: large cypher statements

Reply via email to