*Question background*
We are building a graph (database) for twitter users and tweets (batched
updates for new data).
We store as graph nodes: each-user, each-tweet
We store as graph edges: tweet-tweet relationships, and user-user
relationships (derived, based on users who retweet or reply to others).
*Problem*: Updating the graph is very slow / not scalable
*Goal*: Scalable / efficient update of the existing Neo4J graph as new
tweets come in (tweets translate to: nodes, edges). *Constraint*: If a
node (e.g., user) already exists, we do not want to duplicate it.
Similarly, if an edge (user-user relationship) exists, we only want to
update the edge weight.
*What we have tried*:
*Option 1*: We tried using Cypher's 'MERGE' function to uniquely insert. We
also executed Cypher queries in a batch in order to reduce REST latency.
Sample Cypher query used to update database:
'MERGE (user:User { name:'tom' })', 'MERGE (tweet:Tweet {
tweet_id:'101' })'
We created an index on node attributes like 'name' of User node and
'tweet_id' of Tweet node.
We increased the 'open file descriptors' parameter to gain better
performance in Linux.
*Problems with Option 1*:
Performance of checking uniqueness using 'MERGE' function dropped
dramatically with scale / over time. For example, it took 2.7 second to
insert 100 records when the database was empty. However, it took 62 seconds
to insert the same amount of data with 100,000 existing records.
*Option 2*: The other option we have tried is to check uniqueness
externally. That is, take all nodes and edges and create a hash table
outside Neo4J (e.g., in Python or Java) to check uniqueness. This is faster
than the earlier 'MERGE' function over time. However, it does not seem
elegant to have to extract existing nodes before each batch update. It
requires a read + write from the Neo4J database, instead of only a write.
We are wondering if there is an elegant solution for large data updating in
Neo4j. We feel this may be a common question for many users, and someone
may have previously encountered this and/or developed a robust solution.
--
You received this message because you are subscribed to the Google Groups
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.