That’s a nice big dataset.

I would output the tables to CSV files in sorted order by {id1, id2}. Then I’d 
preprocess them with Python and generate CSV files that import-tool can eat. 
I’d probably keep the files compressed on disk at all times – import-tool in 
2.2-M03 can eat .csv.zip and .csv.gz files just fine. A big portion of this 
import is going to IO bound, so compression will speed that up. You have so 
many CPU cores, anyway.

But before that, because I know that the 6 table files are sorted by {id1, 
id2}, I can process them in parallel in a merge-sort-like way, where I grab the 
input line with the least key (assuming ascending order) to process, and if 
more than one file are currently at that key, then I merge the inputs as you 
described.

Another trick I’d try pulling, since we are pre-processing the data anyway, is 
to assign the generated node and relationship records unique integer id’s. This 
would allow you to tell import-tool to use --id-type ACTUAL, which means it 
won’t have to spend time maintaining a mapping between internal record ids, and 
the varchar ids in your dataset. This will speed up the import.

If for every {ABC, XYZ, ?} record, there’s also a {XYZ, ABC, ?} record, then 
the node ids will be easy to generate, as they can just be a count of how many 
different id1 values you’ve seen. If that assumption does not hold, however, 
then you also need to look at all the id2 values to generate node records, 
which is annoying since they don’t come in sorted order, which in turn means 
that you need to somehow filter out values you’ve already seen (or otherwise 
deterministically compute a distinct integer from the value). But all this is 
only if you want to use --id-type ACTUAL. I don’t know if that’s possible for 
you. Otherwise the import will just take a bit longer.

One other thing: you can’t put more than 2 billion nodes in a schema index, so 
using LOAD CSV with MERGE won’t work for a dataset this big. This is a 
limitation of the version of Lucene that we use. (I don’t know if newer 
versions of Lucene lift this limit, but we plan on addressing it in the future 
regardless.)

--
Chris Vest
System Engineer, Neo Technology
[ skype: mr.chrisvest, twitter: chvest ]


> On 10 Feb 2015, at 04:39, Jesse Liu <[email protected]> wrote:
> 
> Hi, All,
>  
> Thanks for all the helps!
>  
> Now I consider Ziolek's opinion: use the latest version of Neo4j and use 
> import tool.
> According to the import tool examples, I've also update my scenario, as 
> described below.
>  
> Actually, I want to build social network circles.
> I've six months' data in the oracle database, stored as six tables named 
> TH_07, TH_08, TH_09, TH_10, TH_11, TH_12 respectly.
> Every table has the same description:
> id1 varchar, id2 varchar, relationship_property int, primary key is {id1, id2}
> P.S. There may be exactly the same {id1, id2} pair between different tables, 
> but with different relationship_property, e.g. there is one and only one 
> record {ABC, XYZ, 10} in TH_07, and one and only record {ABC, XYZ, int} in 
> other tables like TH_09.
> Each tables has about 80~90 million rows!
>  
> By the way, I set up the Neo4j database and oracle on exactly the same 
> machine with 256GB RAM and 64-core CPU.
>  
> I want to build a graph database which each id1 and id2 represent a node, and 
> if there is a record (id1, id2, relation_property) in oracle, create a 
> relationship between id1 and id2 with relation_property.
>  
> The First Question:
> According to http://neo4j.com/docs/2.2.0-M03/import-tool-examples.html 
> <http://www.google.com/url?q=http%3A%2F%2Fneo4j.com%2Fdocs%2F2.2.0-M03%2Fimport-tool-examples.html&sa=D&sntz=1&usg=AFQjCNE5payw_DRMggDm23dY9PZ2cTYOFw>,
>  first of all I should export the node from oracle into csv file.
> I need UNIQUE node with id, so I have three choices:
> 1. use DISTINCT in oracle, but I have six tables so it's very hard;
> 2. use MERGE in Cypher, but it's too slow! I cannot stand the low effiency;
> 3. use Python to connect to oracle, and preprocess the data in Python (since 
> I've 256GB RAM it's possible to process such big data)
> Is it possible to import 7.5 billion nodes once from csv file?
>  
> The Second Question:
> How can I update the relationship_property? For example, I've {ABC, XYZ, 10} 
> in table TH_07, and {ABC, XYZ, 20} in table TH_08, so I hope update 
> relationship between {ABC} and {XYZ} is 10+20 = 30 for simplicity.
> 1. Process it also in Python?
> 2. Can I do this in Cypher?
>  
> The Third Question:
> I've tried LOAD CSV in Neo4j -2.1.6-community version.
> The Cypher language is exactly shown below:
> USING PERIODIC COMMIT
> LOAD CSV WITH HEADERS FROM 'FILEPATH' AS ROW
> CREATE (n:User {id: row.id1})
> CREATE (m:User {id:row.id2})
>  
> However, during the processing, I've encountered error such as "Kenel error, 
> please restart or recover" something like that (Sorry I did not record the 
> error)
>  
> The Last Question:
> How can I set the Neo4j Server Configuration? As you know, I've 7.5 billion 
> nodes and about 100 billion relationships. After importing the data, I should 
> do such computation, such as Degree Centrality, Betweenness Centrality, 
> Closeness Centrality and something like this.
> How can I use my computer efficiently?
>  
> Thank you!
>  
> Yours, Jesse
> 
>  
>  
>  
>  
> 
> 在 2015年2月3日星期二 UTC+8下午4:18:39,Jesse Liu写道:
> Hi, All,
> 
> I'm a beginner of graph database Neo4J.
> Now I need to import the data from Oracle to Neo4j.
> 
> First, I'll describe my application scenario.
> 
> I have just one oracle table with more than 100 million rows.
> The table desc is:
> id1 varchar, id2 varchar, relation_properpy int.
> 
> id1 and id2 are primary key.
> 
> The oracle server and Neo4J server are set up on the same machine.
> 
> Now how I can create nodes for each id and one directed relationship between 
> id1 and id2 for each row?
> 
> As far as I know, there are three ways to do this:
> 1. Java Rest JDBC API
> I've write a code demo and found it's too slow: 100,00 rows per minute.
> Besides, it's not easy to establish a Java Environment in 
> 
> 2. Python Embedded.
> I haven't write test code right now, but I think it's not better than Java.
> 
> 3.Batch Insert
> Export the data from oracle as CSV file;
> Import the CSV data into Neo4J using Cypher.
> I believe it's the fastest way to import data. However, I don't know how to 
> do this. All the demo I've seen on the Internet is about adding nodes but 
> without adding relationships with specific properties.
> 
> I wonder is there anybody encounter such scenario? Can you give me some 
> advises? Or is there any better solution to import data?
> 
> Thank you very much!
> 
> Jesse
> Feb 3rd, 2015
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected] 
> <mailto:[email protected]>.
> For more options, visit https://groups.google.com/d/optout 
> <https://groups.google.com/d/optout>.

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to