Hi
I have a medium size dataset with 100,000 rows and I use this command for
importing data from csv file to graph database
LOAD CSV WITH HEADERS FROM "file:///E:/datasets/Actors_data_all.csv" AS row
MERGE (c:Country {Name:row.Country})
MERGE (a:Actor {Name: row.ActorName, Aliases: row.Aliases, Type:
row.ActorType})
MERGE (o:Organization {Name: row.AffiliationTo})
MERGE (a)-[:AFFILIATED_TO {Start: row.AffiliationStartDate, End:
row.AffiliationEndDate}]->(o)
MERGE(c)<-[:IS_FROM]-(a);
My PC has 8GB RAM
Is it normal to take more than 6 hours to create 42000 nodes and 100,000
relationships ? If not would you please help me to find out how to fix this
problem and import data from csv file faster?
Thanks in advance
--
Maryam Gerami
R&D department, Informatics services corporation
On Tuesday, February 10, 2015 at 12:44:55 PM UTC+3:30, Chris Vest wrote:
> That’s a nice big dataset.
>
> I would output the tables to CSV files in sorted order by {id1, id2}. Then
> I’d preprocess them with Python and generate CSV files that import-tool can
> eat. I’d probably keep the files compressed on disk at all times –
> import-tool in 2.2-M03 can eat .csv.zip and .csv.gz files just fine. A big
> portion of this import is going to IO bound, so compression will speed that
> up. You have so many CPU cores, anyway.
>
> But before that, because I know that the 6 table files are sorted by {id1,
> id2}, I can process them in parallel in a merge-sort-like way, where I grab
> the input line with the least key (assuming ascending order) to process,
> and if more than one file are currently at that key, then I merge the
> inputs as you described.
>
> Another trick I’d try pulling, since we are pre-processing the data
> anyway, is to assign the generated node and relationship records unique
> integer id’s. This would allow you to tell import-tool to use --id-type
> ACTUAL, which means it won’t have to spend time maintaining a mapping
> between internal record ids, and the varchar ids in your dataset. This will
> speed up the import.
>
> If for every {ABC, XYZ, ?} record, there’s also a {XYZ, ABC, ?} record,
> then the node ids will be easy to generate, as they can just be a count of
> how many different id1 values you’ve seen. If that assumption does not
> hold, however, then you also need to look at all the id2 values to generate
> node records, which is annoying since they don’t come in sorted order,
> which in turn means that you need to somehow filter out values you’ve
> already seen (or otherwise deterministically compute a distinct integer
> from the value). But all this is only if you want to use --id-type ACTUAL.
> I don’t know if that’s possible for you. Otherwise the import will just
> take a bit longer.
>
> One other thing: you can’t put more than 2 billion nodes in a schema
> index, so using LOAD CSV with MERGE won’t work for a dataset this big. This
> is a limitation of the version of Lucene that we use. (I don’t know if
> newer versions of Lucene lift this limit, but we plan on addressing it in
> the future regardless.)
>
> --
> Chris Vest
> System Engineer, Neo Technology
> [ skype: mr.chrisvest, twitter: chvest ]
>
>
> On 10 Feb 2015, at 04:39, Jesse Liu <[email protected] <javascript:>>
> wrote:
>
> Hi, All,
>
> Thanks for all the helps!
>
> Now I consider Ziolek's opinion: use the latest version of Neo4j and use
> import tool.
> According to the import tool examples, I've also update my scenario, as
> described below.
>
> Actually, I want to build social network circles.
> I've six months' data in the oracle database, stored as six tables named
> TH_07, TH_08, TH_09, TH_10, TH_11, TH_12 respectly.
> Every table has the same description:
> id1 varchar, id2 varchar, relationship_property int, primary key is {id1,
> id2}
> P.S. There may be exactly the same {id1, id2} pair between different
> tables, but with different relationship_property, e.g. there is one and
> only one record {ABC, XYZ, 10} in TH_07, and one and only record {ABC, XYZ,
> int} in other tables like TH_09.
> Each tables has about 80~90 million rows!
>
> By the way, I set up the Neo4j database and oracle on exactly the same
> machine with 256GB RAM and 64-core CPU.
>
> I want to build a graph database which each id1 and id2 represent a node,
> and if there is a record (id1, id2, relation_property) in oracle, create a
> relationship between id1 and id2 with relation_property.
>
> The First Question:
> According to http://neo4j.com/docs/2.2.0-M03/import-tool-examples.html
> <http://www.google.com/url?q=http%3A%2F%2Fneo4j.com%2Fdocs%2F2.2.0-M03%2Fimport-tool-examples.html&sa=D&sntz=1&usg=AFQjCNE5payw_DRMggDm23dY9PZ2cTYOFw>,
>
> first of all I should export the node from oracle into csv file.
> I need UNIQUE node with id, so I have three choices:
> 1. use DISTINCT in oracle, but I have six tables so it's very hard;
> 2. use MERGE in Cypher, but it's too slow! I cannot stand the low effiency;
> 3. use Python to connect to oracle, and preprocess the data in Python
> (since I've 256GB RAM it's possible to process such big data)
> Is it possible to import 7.5 billion nodes once from csv file?
>
> The Second Question:
> How can I update the relationship_property? For example, I've {ABC, XYZ,
> 10} in table TH_07, and {ABC, XYZ, 20} in table TH_08, so I hope update
> relationship between {ABC} and {XYZ} is 10+20 = 30 for simplicity.
> 1. Process it also in Python?
> 2. Can I do this in Cypher?
>
> The Third Question:
> I've tried LOAD CSV in Neo4j -2.1.6-community version.
> The Cypher language is exactly shown below:
> USING PERIODIC COMMIT
> LOAD CSV WITH HEADERS FROM 'FILEPATH' AS ROW
> CREATE (n:User {id: row.id1})
> CREATE (m:User {id:row.id2})
>
> However, during the processing, I've encountered error such as "Kenel
> error, please restart or recover" something like that (Sorry I did not
> record the error)
>
> The Last Question:
> How can I set the Neo4j Server Configuration? As you know, I've 7.5
> billion nodes and about 100 billion relationships. After importing the
> data, I should do such computation, such as Degree Centrality, Betweenness
> Centrality, Closeness Centrality and something like this.
> How can I use my computer efficiently?
>
> Thank you!
>
> Yours, Jesse
>
>
>
>
>
>
> 在 2015年2月3日星期二 UTC+8下午4:18:39,Jesse Liu写道:
>
>> Hi, All,
>>
>> I'm a beginner of graph database Neo4J.
>> Now I need to import the data from Oracle to Neo4j.
>>
>> First, I'll describe my application scenario.
>>
>> I have just one oracle table with more than 100 million rows.
>> The table desc is:
>> id1 varchar, id2 varchar, relation_properpy int.
>>
>> id1 and id2 are primary key.
>>
>> The oracle server and Neo4J server are set up on the same machine.
>>
>> Now how I can create nodes for each id and one directed relationship
>> between id1 and id2 for each row?
>>
>> As far as I know, there are three ways to do this:
>> 1. Java Rest JDBC API
>> I've write a code demo and found it's too slow: 100,00 rows per minute.
>> Besides, it's not easy to establish a Java Environment in
>>
>> 2. Python Embedded.
>> I haven't write test code right now, but I think it's not better than
>> Java.
>>
>> 3.Batch Insert
>> Export the data from oracle as CSV file;
>> Import the CSV data into Neo4J using Cypher.
>> I believe it's the fastest way to import data. However, I don't know how
>> to do this. All the demo I've seen on the Internet is about adding nodes
>> but without adding relationships with specific properties.
>>
>> I wonder is there anybody encounter such scenario? Can you give me some
>> advises? Or is there any better solution to import data?
>>
>> Thank you very much!
>>
>> Jesse
>> Feb 3rd, 2015
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected] <javascript:>.
> For more options, visit https://groups.google.com/d/optout.
>
>
>
--
You received this message because you are subscribed to the Google Groups
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.