Re: [Neo4j] Import Data From Oracle to Neo4J

Michael Hunger Sat, 27 Feb 2016 16:58:44 -0800

Hey,

It should just take a few seconds.


I presume:

you use Neo4j 2.3.2 ?
you created indexes / constraints for the things you merge on ?
you configured your neo4j instance to run with at least 4G of heap?
you are using PERIODIC COMMIT ?

I suggest that you run a profile on your statement to see where the biggest
issues show up.

Otherwise it is very recommended to split it up.

e.g. like this:

CREATE CONSTRAINT ON (c:Country) ASSERT c.Name IS UNIQUE;
CREATE CONSTRAINT ON (o:Organization) ASSERT o.Name IS UNIQUE;
CREATE CONSTRAINT ON (a:Actor) ASSERT a.Name IS UNIQUE;


LOAD CSV WITH HEADERS FROM "file:///E:/datasets/Actors_data_all.csv" AS row
WITH distinct row.Country as Country
MERGE (c:Country {Name:Country});

LOAD CSV WITH HEADERS FROM "file:///E:/datasets/Actors_data_all.csv" AS row
WITH distinct row.AffiliationTo as AffiliationTo
MERGE (o:Organization {Name: AffiliationTo});

LOAD CSV WITH HEADERS FROM "file:///E:/datasets/Actors_data_all.csv" AS row
MERGE (a:Actor {Name: row.ActorName}) ON CREATE SET a.Aliases=row.Aliases,
a.Type=row.ActorType;

LOAD CSV WITH HEADERS FROM "file:///E:/datasets/Actors_data_all.csv" AS row
WITH distinct row.Country as Country, row.ActorName as ActorName
MATCH (c:Country {Name:Country})
MATCH (a:Actor {Name:ActorName})
MERGE(c)<-[:IS_FROM]-(a);

LOAD CSV WITH HEADERS FROM "file:///E:/datasets/Actors_data_all.csv" AS row
MATCH (o:Organization {Name: row.AffiliationTo})
MATCH (a:Actor {Name: row.ActorName})
MERGE (a)-[r:AFFILIATED_TO]->(o)
  ON CREATE SET r.Start=row.AffiliationStartDate,
r.End=row.AffiliationEndDate;



On Sat, Feb 27, 2016 at 1:53 PM, Maryam Gerami <[email protected]> wrote:

> Hi
>
> I have a medium size dataset with 100,000 rows and I use this command for
> importing data from csv file to graph database
>
> LOAD CSV WITH HEADERS FROM "file:///E:/datasets/Actors_data_all.csv" AS
> row
> MERGE (c:Country {Name:row.Country})
> MERGE (a:Actor {Name: row.ActorName, Aliases: row.Aliases, Type:
> row.ActorType})
> MERGE (o:Organization {Name: row.AffiliationTo})
> MERGE (a)-[:AFFILIATED_TO {Start: row.AffiliationStartDate, End:
> row.AffiliationEndDate}]->(o)
> MERGE(c)<-[:IS_FROM]-(a);
>
> My PC has 8GB RAM
>
> Is it normal to take more than 6 hours to create 42000 nodes and 100,000
> relationships ? If not would you please help me to find out how to fix this
> problem and import data from csv file faster?
>
> Thanks in advance
> --
> Maryam Gerami
> R&D department, Informatics services corporation
>
> On Tuesday, February 10, 2015 at 12:44:55 PM UTC+3:30, Chris Vest wrote:
>
>> That’s a nice big dataset.
>>
>> I would output the tables to CSV files in sorted order by {id1, id2}.
>> Then I’d preprocess them with Python and generate CSV files that
>> import-tool can eat. I’d probably keep the files compressed on disk at all
>> times – import-tool in 2.2-M03 can eat .csv.zip and .csv.gz files just
>> fine. A big portion of this import is going to IO bound, so compression
>> will speed that up. You have so many CPU cores, anyway.
>>
>> But before that, because I know that the 6 table files are sorted by
>> {id1, id2}, I can process them in parallel in a merge-sort-like way, where
>> I grab the input line with the least key (assuming ascending order) to
>> process, and if more than one file are currently at that key, then I merge
>> the inputs as you described.
>>
>> Another trick I’d try pulling, since we are pre-processing the data
>> anyway, is to assign the generated node and relationship records unique
>> integer id’s. This would allow you to tell import-tool to use --id-type
>> ACTUAL, which means it won’t have to spend time maintaining a mapping
>> between internal record ids, and the varchar ids in your dataset. This will
>> speed up the import.
>>
>> If for every {ABC, XYZ, ?} record, there’s also a {XYZ, ABC, ?} record,
>> then the node ids will be easy to generate, as they can just be a count of
>> how many different id1 values you’ve seen. If that assumption does not
>> hold, however, then you also need to look at all the id2 values to generate
>> node records, which is annoying since they don’t come in sorted order,
>> which in turn means that you need to somehow filter out values you’ve
>> already seen (or otherwise deterministically compute a distinct integer
>> from the value). But all this is only if you want to use --id-type ACTUAL.
>> I don’t know if that’s possible for you. Otherwise the import will just
>> take a bit longer.
>>
>> One other thing: you can’t put more than 2 billion nodes in a schema
>> index, so using LOAD CSV with MERGE won’t work for a dataset this big. This
>> is a limitation of the version of Lucene that we use. (I don’t know if
>> newer versions of Lucene lift this limit, but we plan on addressing it in
>> the future regardless.)
>>
>> --
>> Chris Vest
>> System Engineer, Neo Technology
>> [ skype: mr.chrisvest, twitter: chvest ]
>>
>>
>> On 10 Feb 2015, at 04:39, Jesse Liu <[email protected]> wrote:
>>
>> Hi, All,
>>
>> Thanks for all the helps!
>>
>> Now I consider Ziolek's opinion: use the latest version of Neo4j and use
>> import tool.
>> According to the import tool examples, I've also update my scenario, as
>> described below.
>>
>> Actually, I want to build social network circles.
>> I've six months' data in the oracle database, stored as six tables named
>> TH_07, TH_08, TH_09, TH_10, TH_11, TH_12 respectly.
>> Every table has the same description:
>> id1 varchar, id2 varchar, relationship_property int, primary key is {id1,
>> id2}
>> P.S. There may be exactly the same {id1, id2} pair between different
>> tables, but with different relationship_property, e.g. there is one and
>> only one record {ABC, XYZ, 10} in TH_07, and one and only record {ABC, XYZ,
>> int} in other tables like TH_09.
>> Each tables has about 80~90 million rows!
>>
>> By the way, I set up the Neo4j database and oracle on exactly the same
>> machine with 256GB RAM and 64-core CPU.
>>
>> I want to build a graph database which each id1 and id2 represent a node,
>> and if there is a record (id1, id2, relation_property) in oracle, create a
>> relationship between id1 and id2 with relation_property.
>>
>> The First Question:
>> According to http://neo4j.com/docs/2.2.0-M03/import-tool-examples.html
>> <http://www.google.com/url?q=http%3A%2F%2Fneo4j.com%2Fdocs%2F2.2.0-M03%2Fimport-tool-examples.html&sa=D&sntz=1&usg=AFQjCNE5payw_DRMggDm23dY9PZ2cTYOFw>,
>> first of all I should export the node from oracle into csv file.
>> I need UNIQUE node with id, so I have three choices:
>> 1. use DISTINCT in oracle, but I have six tables so it's very hard;
>> 2. use MERGE in Cypher, but it's too slow! I cannot stand the low
>> effiency;
>> 3. use Python to connect to oracle, and preprocess the data in Python
>> (since I've 256GB RAM it's possible to process such big data)
>> Is it possible to import 7.5 billion nodes once from csv file?
>>
>> The Second Question:
>> How can I update the relationship_property? For example, I've {ABC, XYZ,
>> 10} in table TH_07, and {ABC, XYZ, 20} in table TH_08, so I hope update
>> relationship between {ABC} and {XYZ} is 10+20 = 30 for simplicity.
>> 1. Process it also in Python?
>> 2. Can I do this in Cypher?
>>
>> The Third Question:
>> I've tried LOAD CSV in Neo4j -2.1.6-community version.
>> The Cypher language is exactly shown below:
>> USING PERIODIC COMMIT
>> LOAD CSV WITH HEADERS FROM 'FILEPATH' AS ROW
>> CREATE (n:User {id: row.id1})
>> CREATE (m:User {id:row.id2})
>>
>> However, during the processing, I've encountered error such as "Kenel
>> error, please restart or recover" something like that (Sorry I did not
>> record the error)
>>
>> The Last Question:
>> How can I set the Neo4j Server Configuration? As you know, I've 7.5
>> billion nodes and about 100 billion relationships. After importing the
>> data, I should do such computation, such as Degree Centrality, Betweenness
>> Centrality, Closeness Centrality and something like this.
>> How can I use my computer efficiently?
>>
>> Thank you!
>>
>> Yours, Jesse
>>
>>
>>
>>
>>
>>
>> 在 2015年2月3日星期二 UTC+8下午4:18:39，Jesse Liu写道：
>>
>>> Hi, All,
>>>
>>> I'm a beginner of graph database Neo4J.
>>> Now I need to import the data from Oracle to Neo4j.
>>>
>>> First, I'll describe my application scenario.
>>>
>>> I have just one oracle table with more than 100 million rows.
>>> The table desc is:
>>> id1 varchar, id2 varchar, relation_properpy int.
>>>
>>> id1 and id2 are primary key.
>>>
>>> The oracle server and Neo4J server are set up on the same machine.
>>>
>>> Now how I can create nodes for each id and one directed relationship
>>> between id1 and id2 for each row?
>>>
>>> As far as I know, there are three ways to do this:
>>> 1. Java Rest JDBC API
>>> I've write a code demo and found it's too slow: 100,00 rows per minute.
>>> Besides, it's not easy to establish a Java Environment in
>>>
>>> 2. Python Embedded.
>>> I haven't write test code right now, but I think it's not better than
>>> Java.
>>>
>>> 3.Batch Insert
>>> Export the data from oracle as CSV file;
>>> Import the CSV data into Neo4J using Cypher.
>>> I believe it's the fastest way to import data. However, I don't know how
>>> to do this. All the demo I've seen on the Internet is about adding nodes
>>> but without adding relationships with specific properties.
>>>
>>> I wonder is there anybody encounter such scenario? Can you give me some
>>> advises? Or is there any better solution to import data?
>>>
>>> Thank you very much!
>>>
>>> Jesse
>>> Feb 3rd, 2015
>>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Neo4j" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] Import Data From Oracle to Neo4J

Reply via email to