If you look at the video it is pretty obvious, she outlines all the major steps and pitfalls.
Except for one, *create nodes and rels separately* if you need more than one MERGE GOOD merge|match|create node merge|match|create node create erel GOOD match node match node MERGE rel BAD match|create node merge node merge rel BAD match node set node.prop On Sat, Nov 29, 2014 at 1:53 AM, FANC2 <[email protected]> wrote: > Both. Using what I did before the loading either never finished or > failed. I’m trying to not follow that example with the figuring it out!! > :) > > On Nov 28, 2014, at 7:50 PM, Michael Hunger < > [email protected]> wrote: > > What takes so long? The loading? Or figuring it out? > > Michael > > > On Sat, Nov 29, 2014 at 1:18 AM, José F. Morales <[email protected]> > wrote: > >> Hey Michael, >> >> I'll check it out. Trouble is knowing what over-complicating is. >> Thanks for the heads up! >> >> I am trying to figure out inductively how to use LOAD CSV from various >> examples. Thanks for another one. >> >> Its killing me that its taking so long. >> >> Jose >> >> >> >> On Friday, November 28, 2014 6:16:49 PM UTC-5, Michael Hunger wrote: >>> >>> José >>> >>> if you watch Nicole's webinar many things will become clear. >>> https://vimeo.com/112447027 >>> You don't have to overcomplicate things. >>> >>> The Skewer(id) thing is not really needed if each of your entities has a >>> label and a primary key of some sorts. >>> It is just an optimization to not have to think about separate entities. >>> >>> Cheers, Michael >>> >>> On Sat, Nov 29, 2014 at 12:12 AM, José F. Morales <[email protected]> >>> wrote: >>> >>>> Hey Andrii, >>>> >>>> I've been thinking alot about your recommendations. I have some >>>> questions, some of which show how ignorant I am. Apologies for basics >>>> if necessary. >>>> >>>> On Thursday, November 20, 2014 6:22:34 AM UTC-5, Andrii Stesin wrote: >>>>> >>>>> Before you start. >>>>> >>>>> 1. On nodes and their labels. First of all, I strongly suggest you to >>>>> separate your nodes into different .csv files by label. So you won't have >>>>> a >>>>> column *`label`* in your .csv but rather set of files: >>>>> >>>>> nodes_LabelA.csv >>>>> ... >>>>> nodes_LabelZ.csv >>>>> >>>>> whatever your labels are. (Consider label to be kinda of synonym for >>>>> `class` in object-oriented programming or `table` in RDBMS). That's due >>>>> the >>>>> fact that labels in Cypher are somewhat specific entities and you probably >>>>> won't be allowed to make them parameterized into variables inside your >>>>> LOAD >>>>> CSV statement. >>>>> >>>>> >>>> OK, so you have modified your original idea of putting the db into two >>>> files 1 nodes , 1 relationships. Now here you say, put all the nodes into >>>> 1 file/ label. The way I have worked with it, I created 1 file for a >>>> class of nodes I'll call CLT_SOURCE and another file for a class of nodes >>>> called CLT_TARGET. Then I have a file for the relationships. Perhaps >>>> foolishly I originally would create 1 file that would combine all of this >>>> info and try to paste it in the browser or in the shell. Neither worked >>>> even though with smaller amount of data it did. >>>> >>>> You are recommending that with the nodes, I take two steps... >>>> 1) Combine my CLT_SOURCE and CLT_TARGET nodes, >>>> 2) then I split that file into files that correspond to the node: >>>> *my_node_id, * 1 label, and then properties P1...Pn. Since I have 10 >>>> Labels/node, I should have 10 files named..... Nodes_LabelA... >>>> Nodes_LabelJ. Thus... >>>> >>>> File: CLT_Nodes-LabelA columns: *my_node_id,* label A, property >>>> P1..., property P4 >>>> ... >>>> File: CLT_Nodes-LabelJ columns: *my_node_id,* label B, property >>>> P1..., property P4 >>>> >>>> >>>> Q1: What are the rules about what can be used for *my_node_id? *I >>>> have usually seen them as a letter integer combination. Is that the >>>> convention? Sometimes I've seen a letter being used with a specific class >>>> of nodes a1..a100 for one class and b1..b100 for another. I learned the >>>> hard way that you have to give each node a unique ID. I used CLT_1...CLT_n >>>> for my CLT_SOURCE nodes and CLT_TARGET_1...CLT_TARGET_n for my TARGET >>>> nodes. It worked with the smaller db I made. Anything wrong using the >>>> convention n1...n100? >>>> >>>> >>>> >>>>> 2. Then consider one additional "technological" label, let's name it >>>>> `:Skewer` because it will "penetrate" all your nodes of every different >>>>> label (class) like a kebab skewer. >>>>> >>>>> Before you start (or at least before you start importing >>>>> relationships) do >>>>> >>>>> CREATE CONSTRAINT ON (every_node:Skewer) ASSERT every_node.my_node_id >>>>> IS UNIQUE; >>>>> >>>>> >>>> Q2: Should I do scenario 1 or 2? >>>> >>>> Scenario 1: add two labels to each file? One from my original nodes >>>> and one as "Skewer" >>>> >>>> File 1: CLT_Nodes-LabelA columns: *my_node_id,* label A, *Skewer*, >>>> property P1..., property P4 >>>> ... >>>> File 2: CLT_Nodes-LabelJ columns: *my_node_id,* label J, *Skewer*, >>>> property P1..., property P4 >>>> >>>> OR >>>> >>>> Scenario 2: Include an eleventh file thus.... >>>> >>>> File 11: CLT_Nodes-LabelK columns: *my_node_id,* *Skewer*, >>>> property P1..., property P4 >>>> >>>> From below, I think you mean Scenario 1. >>>> >>>> Q3: “Skewer” is just an integer right? It corresponds in a way to >>>> my_node_id >>>> >>>> 3. When doing LOAD CSV with nodes, make sure each node will get 2 (two) >>>>> labels, one of them is `:Skewer`. This will create index on `my_node_id` >>>>> attribute (makes relationships creation some orders of magnitude faster) >>>>> and you'll be sure you don't have occasional duplicate nodes, as a bonus. >>>>> >>>> >>>> >>>> Here is some sort of cypher…. >>>> >>>> //Creating the nodes >>>> >>>> >>>> >>>> USING PERIODIC COMMIT 1000 >>>> >>>> LOAD CSV WITH HEADERS FROM “…/././…. CLT_NODES_LabelA.csv" AS csvline >>>> >>>> MERGE (my_node_id:Skewer: LabelA {property1: csvline.property1}) >>>> >>>> ON CREATE SET >>>> >>>> n.Property2 = csvline.Property2, >>>> >>>> n.Property3 = csvline.Property3, >>>> >>>> n.Property4 = csvline.Property4; >>>> >>>> >>>> …. >>>> LOAD CSV WITH HEADERS FROM “…/././…. CLT_NODES_LabelJ.csv" AS csvline >>>> >>>> >>>> >>>> MERGE (my_node_id:Skewer: LabelJ {property1: csvline.property1}) >>>> >>>> ON CREATE SET >>>> >>>> n.Property2 = csvline.Property2, >>>> >>>> n.Property3 = csvline.Property3, >>>> >>>> n.Property4 = csvline.Property4; >>>> >>>> >>>> >>>> >>>> Q4: So does repeating the LOAD CSV with each file CLT_NODES_LabelA…J >>>> combine the various labels and their respective values with their >>>> corresponding nodes? >>>> >>>> Q5: Since I think of my data in terms of the two classes of nodes in my >>>> Data model …[CLT_SOURCE —> CLT_TARGET ; CLT_TARGET —> CLT_SOURCE], after >>>> loading the nodes, how then I get two classes of nodes? >>>> >>>> Q6: Is there a step missing that explains how the code below got to >>>> have a “source_node” and a “dest_node” that appears to correspond to my >>>> CLT_SOURCE and CLT_TARGET nodes? >>>> >>>> >>>> >>>> >>>>> 4. Now when you are done with nodes and start doing LOAD CSV for >>>>> relationships, you may give the MATCH statement, which looks up your pair >>>>> of nodes, a hint for fast lookup, like >>>>> >>>>> LOAD CSV ...from somewhere... AS csvline >>>>> MATCH (source_node:Skewer {my_node_id: ToInt(csvline[0]}), (dest_node: >>>>> Skewer {my_node_id: ToInt(csvline[1]}) >>>>> CREATE (source_node)-[r:MY_REL_TYPE {rel_prop_00: csvline[2], ..., >>>>> rel_prop_NN: csvline[ZZ]}]->(dest_node); >>>>> >>>>> >>>> Q6: This LOAD CSV command (line 1) looks into the separate REL.csv >>>> file you mentioned first right? >>>> >>>> Q7: csvline is some sort of temp file that is a series of lines of the >>>> cvs file? >>>> >>>> Q8: Do you imply in line 2 that the REL.csv file has headers that >>>> include source_node, dest_node ? >>>> >>>> Q9: While I see how Skewer is a label, how is my_node_id a property >>>> (line 2) ? >>>> >>>> Q10: How does my_node_id relate to either ToInt(csvline[0]} or >>>> ToInt(csvline[1]} (line 2) ? >>>> >>>> Is it that ToInt(csvline[0]} refers to the a line of the REL.csv file? >>>> >>>> Does csvline[0] refer to a column in REL.csv as do csvline[2] and >>>> csvline[ZZ] (line 3) ? >>>> >>>> >>>>> Adding *`:Skewer` *label in MATCH will tell Cypher to (implicitly) >>>>> use your index on *my_node_id* which was created when you created >>>>> your constraint. Or you may try to explicitly give it a hint to use the >>>>> index, with USING INDEX... clause after MATCH before CREATE. Btw some >>>>> earlier versions of Neo4j refused to use index in LOAD CSV for some >>>>> reason, >>>>> I hope this problem is gone with 2.1.5. >>>>> >>>>> OK >>>> >>>> >>>>> 5. While importing, be careful to *explicitly specify type >>>>> conversions for each property which is not a string*. I have seen >>>>> numerous occasions when people missed ToInt(csvline[i]) or >>>>> ToFloat(csvline[j]) - and Cypher silently stored their (supposed) numerics >>>>> as strings. It's Ok, dude, you say it :) This led to confusion afterwards >>>>> when say numerical comparisons doesn't MATCH and so on (though it's easy >>>>> to >>>>> correct with a single Cypher command, but anyway). >>>>> >>>>> Think I did that re. type conversion. Only applies to properties for >>>> my data. >>>> >>>> Sorry for so many questions. I am really interested in figuring this >>>> out! >>>> >>>> Thanks loads, >>>> Jose >>>> >>>> >>>> >>>>> WBR, >>>>> Andrii >>>>> >>>>> On Wednesday, November 19, 2014 9:36:50 PM UTC+2, José F. Morales >>>>> wrote: >>>>>> >>>>>> >>>>>> 3. CSV approach >>>>>> a. “Dump the base into 2 .csv files:” >>>>>> b. CSV1: “Describe nodes (enumerate them via some my_node_id integer >>>>>> attribute), columns: my_node_id,label,node_prop_01,node_prop_ZZ” >>>>>> c. CSV2: “Describe relations, columns: source_my_node_id, >>>>>> dest_my_node_id,rel_type,rel_prop_01,...,rel_prop_NN” >>>>>> d. Indexes constraints: before starting import —> have appropriate >>>>>> indexes / constraints >>>>>> e. via LOAD CSV, import CSV1, then CSV2. >>>>>> f. Import no more than 10,000-30,000 lines in a single LOAD CSV >>>>>> statement >>>>>> >>>>>> This seems to be a very well elaborated method and the easiest for me >>>>>> to do. I have files such that I can create these without too much >>>>>> problem. I figure I’ll split the nodes into three files 20k rows each. >>>>>> I >>>>>> can do the same with the Rels. I have not used indexes or constraints >>>>>> yet >>>>>> in the db’s that I already created and as I said above, I’ll have to see >>>>>> how to use them. >>>>>> >>>>>> I am assuming column headers that fit with my data are consistent >>>>>> with what you explained below (Like, I can put my own meaningful text >>>>>> into >>>>>> Label 1 -10 and node_prop_01 - 05).... >>>>>> my_node_id, label1, label2, label3, label4, >>>>>> label5, label6, label7, label8, label9, >>>>>> label10, node_prop_01, node_prop_02, node_prop_03, >>>>>> node_prop_04, node_prop_ZZ” >>>>>> >>>>>> Thanks again Fellas!! >>>>>> >>>>>> Jose >>>>>> >>>>>> >>>>>> On Wednesday, November 19, 2014 8:04:44 AM UTC-5, Michael Hunger >>>>>> wrote: >>>>>>> >>>>>>> José, >>>>>>> >>>>>>> Let's continue the discussion on the google group >>>>>>> >>>>>>> With larger I meant amount of data, not size of statements >>>>>>> >>>>>>> As I also point out in various places we recommend creating only >>>>>>> small subgraphs with a single statement separated by srmicolons. >>>>>>> Eg up to 100 nodes and rels >>>>>>> >>>>>>> Gigantic statements just let the parser explode >>>>>>> >>>>>>> I recommending splitting them up into statements creating subgraphs >>>>>>> Or create nodes and later match them by label & property to connect >>>>>>> them >>>>>>> Make sure to have appropriate indexes / constraints >>>>>>> >>>>>>> You should also surround blocks if statements with begin and commit >>>>>>> commands >>>>>>> >>>>>>> Von meinem iPhone gesendet >>>>>>> >>>>>>> Am 19.11.2014 um 04:18 schrieb José F. Morales Ph.D. < >>>>>>> [email protected]>: >>>>>>> >>>>>>> Hey Michael and Kenny >>>>>>> >>>>>>> Thanks you guys a bunch for the help. >>>>>>> >>>>>>> Let me give you a little background. I am charged to make a >>>>>>> prototype of a tool (“LabCards”) that we hope to use in the hospital and >>>>>>> beyond at some point . In preparation for making the main prototype, I >>>>>>> made two prior Neo4j databases that worked exactly as I wanted them to. >>>>>>> The first database was built with NIH data and had 183 nodes and around >>>>>>> 7500 relationships. The second database was the Pre-prototype and it >>>>>>> had >>>>>>> 1080 nodes and around 2000 relationships. I created these in the form >>>>>>> of >>>>>>> cypher statements and either pasted them in the Neo4j browser or used >>>>>>> the >>>>>>> neo4j shell and loaded them as text files. Before doing that I checked >>>>>>> the >>>>>>> cypher code with Sublime Text 2 that highlights the code. Both databases >>>>>>> loaded fine in both methods and did what I wanted them to do. >>>>>>> >>>>>>> As you might imagine, the prototype is an expansion of the >>>>>>> mini-prototype. It has almost the same data model and I built it as a >>>>>>> series of cypher statements as well. My first version of the prototype >>>>>>> had >>>>>>> ~60k nodes and 160k relationships. >>>>>>> >>>>>>> I should say that a feature of this model is that all the source and >>>>>>> target nodes have relationships that point to each other. No node >>>>>>> points >>>>>>> to itself as far as I know. This file was 41 Mb of cypher code that I >>>>>>> tried >>>>>>> to load via the neo4j shell. >>>>>>> >>>>>>> In fact, I was following your advise on loading big data files... >>>>>>> “Use the Neo4j-Shell for larger Imports” ( >>>>>>> http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-an >>>>>>> d-successfully/). This first time out, Java maxed out its memory >>>>>>> allocated at 4Gb 2x and did not complete loading in 24 hours. I killed >>>>>>> it. >>>>>>> >>>>>>> I then contacted Kenny, and he generously gave me some advice >>>>>>> regarding the properties file (below) and again the same deal (4 Gb >>>>>>> Memory >>>>>>> 2x) with Java and no success in about 24 hours. I killed that one too. >>>>>>> >>>>>>> Given my loading problems, I have subsequently eliminated a bunch >>>>>>> relationships (100k) so that the file is now 21 Mb. Alot of these were >>>>>>> duplicates that I didn’t pick up before and am trying it again. So far >>>>>>> 15 >>>>>>> min into it, similar situation. The difference is that Java is using >>>>>>> 1.7 >>>>>>> and 0.5 GB of memory >>>>>>> >>>>>>> Here is the cypher for a typical node… >>>>>>> >>>>>>> CREATE ( CLT_1:`CLT SOURCE`:BIOMEDICAL:TEST_NAME:`Laboratory >>>>>>> Procedure`:lbpr:`Procedures`:PROC:T059:`B1.3.1.1`:TZ{NAME:'Acetoacetate >>>>>>> (ketone body)',SYNONYM:'',Sample:'SERUM, >>>>>>> URINE',MEDCODE:10010,CUI:'NA’}) >>>>>>> >>>>>>> Here is the cypher for a typical relationship... >>>>>>> >>>>>>> CREATE(CLT_1)-[:MEASUREMENT_OF{Phylum:'TZ',CAT:'TEST.NAME >>>>>>> <http://test.name/>',Ui_Rl:'T157',RESULT:'',Type:'' >>>>>>> ,Semantic_Distance_Score:'NA',Path_Length:'NA',Path_Steps:'N >>>>>>> A'}]->(CLT_TARGET_3617), >>>>>>> >>>>>>> I will let you know how this one turns out. I hope this is helpful. >>>>>>> >>>>>>> Many, many thanks fellas!!! >>>>>>> >>>>>>> Jose >>>>>>> >>>>>>> On Nov 18, 2014, at 8:33 PM, Michael Hunger < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>> Hi José, >>>>>>> >>>>>>> can you provide perhaps more detail about your dataset (e.g. sample >>>>>>> of the csv, size, etc. perhaps an output of csvstat (of csvkit) would be >>>>>>> helpful), your cypher queries to load it >>>>>>> >>>>>>> Have you seen my other blog post, which explains two big caveats >>>>>>> that people run into when trying this? jexp.de/blog/2014/10/loa >>>>>>> d-cvs-with-success/ >>>>>>> >>>>>>> Cheers, Michael >>>>>>> >>>>>>> On Tue, Nov 18, 2014 at 8:43 PM, Kenny Bastani <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hey Jose, >>>>>>>> >>>>>>>> There is definitely an answer. Let me put you in touch with the >>>>>>>> data import master: Michael Hunger. >>>>>>>> >>>>>>>> Michael, I think the answers here will be pretty straight forward >>>>>>>> for you. You met Jose at GraphConnect NY last year, so I'll spare any >>>>>>>> introductions. The memory map configurations I provided need to be >>>>>>>> calculated and customized for the data import volume. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Kenny >>>>>>>> >>>>>>>> Sent from my iPhone >>>>>>>> >>>>>>>> On Nov 18, 2014, at 11:37 AM, José F. Morales Ph.D. < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>> Kenny, >>>>>>>> >>>>>>>> In 3 hours it’ll be trying to load for 24 hours so this is not >>>>>>>> working. I’m catching shit from my crew too, so I got to fix this like >>>>>>>> soon. >>>>>>>> >>>>>>>> I haven’t done this before, but can I break up the data and load it >>>>>>>> in pieces? >>>>>>>> >>>>>>>> Jose >>>>>>>> >>>>>>>> On Nov 17, 2014, at 3:35 PM, Kenny Bastani <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Hey Jose, >>>>>>>> >>>>>>>> Try turning off the object cache. Add this line to the >>>>>>>> neo4j.properties configuration file: >>>>>>>> >>>>>>>> cache_type=none >>>>>>>> >>>>>>>> Then retry your import. Also, enable memory mapped files by adding >>>>>>>> these lines to the neo4j.properties file: >>>>>>>> >>>>>>>> neostore.nodestore.db.mapped_memory=2048M >>>>>>>> neostore.relationshipstore.db.mapped_memory=4096M >>>>>>>> neostore.propertystore.db.mapped_memory=200M >>>>>>>> neostore.propertystore.db.strings.mapped_memory=500M >>>>>>>> neostore.propertystore.db.arrays.mapped_memory=500M >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Kenny >>>>>>>> >>>>>>>> ------------------------------ >>>>>>>> *From:* José F. Morales Ph.D. <[email protected]> >>>>>>>> *Sent:* Monday, November 17, 2014 12:32 PM >>>>>>>> *To:* Kenny Bastani >>>>>>>> *Subject:* latest >>>>>>>> >>>>>>>> Hey Kenny, >>>>>>>> >>>>>>>> Here’s the deal. As I think I said, I loaded the 41 Mb file of >>>>>>>> cypher code via the neo4j shell. Before I tried the LabCards file, I >>>>>>>> tried >>>>>>>> the movies file and a UMLS database I made (8k relationships). They >>>>>>>> worked >>>>>>>> fine. >>>>>>>> >>>>>>>> The LabCards file is taking a LONG time to load since I started at >>>>>>>> about 9:30 - 10 PM last night and its 3PM now. >>>>>>>> >>>>>>>> I’ve wondered if its hung up and the activity monitor’s memory >>>>>>>> usage is constant at two rows of Java at 4GB w/ the kernel at 1 GB. >>>>>>>> The >>>>>>>> CPU panel changes alot so it looks like its doing its thing. >>>>>>>> >>>>>>>> So is this how are things to be expected? Do you think the loading >>>>>>>> is gonna take a day or two? >>>>>>>> >>>>>>>> Jose >>>>>>>> >>>>>>>> >>>>>>>> |//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\|| >>>>>>>> José F. Morales Ph.D. >>>>>>>> Instructor >>>>>>>> Cell Biology and Pathology >>>>>>>> Columbia University Medical Center >>>>>>>> [email protected] >>>>>>>> 212-452-3351 >>>>>>>> >>>>>>>> >>>>>>>> |//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\|| >>>>>>>> José F. Morales Ph.D. >>>>>>>> Instructor >>>>>>>> Cell Biology and Pathology >>>>>>>> Columbia University Medical Center >>>>>>>> [email protected] >>>>>>>> 212-452-3351 >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> |//.\\||//.\\|||//.\\||//.\\|||//.\\||//.\\|| >>>>>>> José F. Morales Ph.D. >>>>>>> Instructor >>>>>>> Cell Biology and Pathology >>>>>>> Columbia University Medical Center >>>>>>> [email protected] >>>>>>> 212-452-3351 >>>>>>> >>>>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "Neo4j" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "Neo4j" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> For more options, visit https://groups.google.com/d/optout. >> > > > -- > You received this message because you are subscribed to a topic in the > Google Groups "Neo4j" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/neo4j/jSFtnD5OHxg/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected]. > For more options, visit https://groups.google.com/d/optout. > > > José F. Morales Ph.D. > [email protected] > > > > -- > You received this message because you are subscribed to the Google Groups > "Neo4j" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "Neo4j" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
