Hi

I have a medium size dataset with 100,000 rows and I use this command for 
importing data from csv file to graph database

LOAD CSV WITH HEADERS FROM "file:///E:/datasets/Actors_data_all.csv" AS row
MERGE (c:Country {Name:row.Country})
MERGE (a:Actor {Name: row.ActorName, Aliases: row.Aliases, Type: 
row.ActorType})
MERGE (o:Organization {Name: row.AffiliationTo})
MERGE (a)-[:AFFILIATED_TO {Start: row.AffiliationStartDate, End: 
row.AffiliationEndDate}]->(o)
MERGE(c)<-[:IS_FROM]-(a);

My PC has 8GB RAM 

Is it normal to take more than 6 hours to create 42000 nodes and 100,000 
relationships ? If not would you please help me to find out how to fix this 
problem and import data from csv file faster?

Thanks in advance
--
Maryam Gerami
R&D department, Informatics services corporation

On Tuesday, February 10, 2015 at 12:44:55 PM UTC+3:30, Chris Vest wrote:

> That’s a nice big dataset.
>
> I would output the tables to CSV files in sorted order by {id1, id2}. Then 
> I’d preprocess them with Python and generate CSV files that import-tool can 
> eat. I’d probably keep the files compressed on disk at all times – 
> import-tool in 2.2-M03 can eat .csv.zip and .csv.gz files just fine. A big 
> portion of this import is going to IO bound, so compression will speed that 
> up. You have so many CPU cores, anyway.
>
> But before that, because I know that the 6 table files are sorted by {id1, 
> id2}, I can process them in parallel in a merge-sort-like way, where I grab 
> the input line with the least key (assuming ascending order) to process, 
> and if more than one file are currently at that key, then I merge the 
> inputs as you described.
>
> Another trick I’d try pulling, since we are pre-processing the data 
> anyway, is to assign the generated node and relationship records unique 
> integer id’s. This would allow you to tell import-tool to use --id-type 
> ACTUAL, which means it won’t have to spend time maintaining a mapping 
> between internal record ids, and the varchar ids in your dataset. This will 
> speed up the import.
>
> If for every {ABC, XYZ, ?} record, there’s also a {XYZ, ABC, ?} record, 
> then the node ids will be easy to generate, as they can just be a count of 
> how many different id1 values you’ve seen. If that assumption does not 
> hold, however, then you also need to look at all the id2 values to generate 
> node records, which is annoying since they don’t come in sorted order, 
> which in turn means that you need to somehow filter out values you’ve 
> already seen (or otherwise deterministically compute a distinct integer 
> from the value). But all this is only if you want to use --id-type ACTUAL. 
> I don’t know if that’s possible for you. Otherwise the import will just 
> take a bit longer.
>
> One other thing: you can’t put more than 2 billion nodes in a schema 
> index, so using LOAD CSV with MERGE won’t work for a dataset this big. This 
> is a limitation of the version of Lucene that we use. (I don’t know if 
> newer versions of Lucene lift this limit, but we plan on addressing it in 
> the future regardless.)
>
> --
> Chris Vest
> System Engineer, Neo Technology
> [ skype: mr.chrisvest, twitter: chvest ]
>
>
> On 10 Feb 2015, at 04:39, Jesse Liu <[email protected] <javascript:>> 
> wrote:
>
> Hi, All,
>  
> Thanks for all the helps!
>  
> Now I consider Ziolek's opinion: use the latest version of Neo4j and use 
> import tool.
> According to the import tool examples, I've also update my scenario, as 
> described below.
>  
> Actually, I want to build social network circles.
> I've six months' data in the oracle database, stored as six tables named 
> TH_07, TH_08, TH_09, TH_10, TH_11, TH_12 respectly.
> Every table has the same description:
> id1 varchar, id2 varchar, relationship_property int, primary key is {id1, 
> id2}
> P.S. There may be exactly the same {id1, id2} pair between different 
> tables, but with different relationship_property, e.g. there is one and 
> only one record {ABC, XYZ, 10} in TH_07, and one and only record {ABC, XYZ, 
> int} in other tables like TH_09.
> Each tables has about 80~90 million rows!
>  
> By the way, I set up the Neo4j database and oracle on exactly the same 
> machine with 256GB RAM and 64-core CPU.
>  
> I want to build a graph database which each id1 and id2 represent a node, 
> and if there is a record (id1, id2, relation_property) in oracle, create a 
> relationship between id1 and id2 with relation_property.
>  
> The First Question:
> According to http://neo4j.com/docs/2.2.0-M03/import-tool-examples.html 
> <http://www.google.com/url?q=http%3A%2F%2Fneo4j.com%2Fdocs%2F2.2.0-M03%2Fimport-tool-examples.html&sa=D&sntz=1&usg=AFQjCNE5payw_DRMggDm23dY9PZ2cTYOFw>,
>  
> first of all I should export the node from oracle into csv file.
> I need UNIQUE node with id, so I have three choices:
> 1. use DISTINCT in oracle, but I have six tables so it's very hard;
> 2. use MERGE in Cypher, but it's too slow! I cannot stand the low effiency;
> 3. use Python to connect to oracle, and preprocess the data in Python 
> (since I've 256GB RAM it's possible to process such big data)
> Is it possible to import 7.5 billion nodes once from csv file?
>  
> The Second Question:
> How can I update the relationship_property? For example, I've {ABC, XYZ, 
> 10} in table TH_07, and {ABC, XYZ, 20} in table TH_08, so I hope update 
> relationship between {ABC} and {XYZ} is 10+20 = 30 for simplicity.
> 1. Process it also in Python?
> 2. Can I do this in Cypher?
>  
> The Third Question:
> I've tried LOAD CSV in Neo4j -2.1.6-community version.
> The Cypher language is exactly shown below:
> USING PERIODIC COMMIT
> LOAD CSV WITH HEADERS FROM 'FILEPATH' AS ROW
> CREATE (n:User {id: row.id1})
> CREATE (m:User {id:row.id2})
>  
> However, during the processing, I've encountered error such as "Kenel 
> error, please restart or recover" something like that (Sorry I did not 
> record the error)
>  
> The Last Question:
> How can I set the Neo4j Server Configuration? As you know, I've 7.5 
> billion nodes and about 100 billion relationships. After importing the 
> data, I should do such computation, such as Degree Centrality, Betweenness 
> Centrality, Closeness Centrality and something like this.
> How can I use my computer efficiently?
>  
> Thank you!
>  
> Yours, Jesse
>
>  
>  
>  
>  
>
> 在 2015年2月3日星期二 UTC+8下午4:18:39,Jesse Liu写道:
>
>> Hi, All,
>>
>> I'm a beginner of graph database Neo4J.
>> Now I need to import the data from Oracle to Neo4j.
>>
>> First, I'll describe my application scenario.
>>
>> I have just one oracle table with more than 100 million rows.
>> The table desc is:
>> id1 varchar, id2 varchar, relation_properpy int.
>>
>> id1 and id2 are primary key.
>>
>> The oracle server and Neo4J server are set up on the same machine.
>>
>> Now how I can create nodes for each id and one directed relationship 
>> between id1 and id2 for each row?
>>
>> As far as I know, there are three ways to do this:
>> 1. Java Rest JDBC API
>> I've write a code demo and found it's too slow: 100,00 rows per minute.
>> Besides, it's not easy to establish a Java Environment in 
>>
>> 2. Python Embedded.
>> I haven't write test code right now, but I think it's not better than 
>> Java.
>>
>> 3.Batch Insert
>> Export the data from oracle as CSV file;
>> Import the CSV data into Neo4J using Cypher.
>> I believe it's the fastest way to import data. However, I don't know how 
>> to do this. All the demo I've seen on the Internet is about adding nodes 
>> but without adding relationships with specific properties.
>>
>> I wonder is there anybody encounter such scenario? Can you give me some 
>> advises? Or is there any better solution to import data?
>>
>> Thank you very much!
>>
>> Jesse
>> Feb 3rd, 2015
>>
>
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected] <javascript:>.
> For more options, visit https://groups.google.com/d/optout.
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to