[Neo4j] Scalable solution for updating graph data in Neo4j 2.0

Yun Wang Sat, 01 Feb 2014 12:27:45 -0800


*Question background*


We are building a graph (database) for twitter users and tweets (batched 
updates for new data).

We store as graph nodes: each-user, each-tweet

We store as graph edges: tweet-tweet relationships, and user-user 
relationships (derived, based on users who retweet or reply to others).

 

*Problem*: Updating the graph is very slow / not scalable

 

*Goal*: Scalable / efficient update of the existing Neo4J graph as new 
tweets come in (tweets translate to: nodes, edges).  *Constraint*: If a 
node (e.g., user) already exists, we do not want to duplicate it. 
Similarly, if an edge (user-user relationship) exists, we only want to 
update the edge weight.

 

*What we have tried*:

*Option 1*: We tried using Cypher's 'MERGE' function to uniquely insert. We 
also executed Cypher queries in a batch in order to reduce REST latency.

 

Sample Cypher query used to update database:

            'MERGE (user:User { name:'tom' })', 'MERGE (tweet:Tweet { 
tweet_id:'101' })'

 

We created an index on node attributes like 'name' of User node and 
'tweet_id' of Tweet node.

We increased the 'open file descriptors' parameter to gain better 
performance in Linux.

 

*Problems with Option 1*:

Performance of checking uniqueness using 'MERGE' function dropped 
dramatically with scale / over time. For example, it took 2.7 second to 
insert 100 records when the database was empty. However, it took 62 seconds 
to insert the same amount of data with 100,000 existing records.

 

*Option 2*: The other option we have tried is to check uniqueness 
externally. That is, take all nodes and edges and create a hash table 
outside Neo4J (e.g., in Python or Java) to check uniqueness. This is faster 
than the earlier 'MERGE' function over time. However, it does not seem 
elegant to have to extract existing nodes before each batch update. It 
requires a read + write from the Neo4J database, instead of only a write.

 

We are wondering if there is an elegant solution for large data updating in 
Neo4j. We feel this may be a common question for many users, and someone 
may have previously encountered this and/or developed a robust solution.

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

[Neo4j] Scalable solution for updating graph data in Neo4j 2.0

Reply via email to