Re: [Neo4j] Scalable solution for updating graph data in Neo4j 2.0

Michael Hunger Fri, 14 Feb 2014 10:11:46 -0800

Hi,

that's really slow, there is something off
do you by chance also have the csv available for testing?


also what version is this? neo4j-community-2.0.1-RC4 
there is no RC4 ??

Michael

Am 14.02.2014 um 17:45 schrieb Yun Wang <[email protected]>:

> Hi Michael, thank you for your reply. 
> I did a test using parameters and transnational endpoint. The speed was still 
> not satisfying.
> 
> It simply loads a CSV and inserts nodes.
> The CSV file contains 5,000 lines in which each line generates four MERGE 
> operations for node insertion.
> It takes 107 secs to finish the insertion.
> 
> This is the python code of the test:
> from py2neo import neo4j
> from py2neo import node, rel
> from py2neo import cypher
> 
> if __name__ == '__main__':
>     
>     # Use py2neo transaction
>     session = cypher.Session("http://localhost:7474";)
>     tx = session.create_transaction()
>     
>     # python Dictionary for parameters
>     user = {'name':''}
>     tweet = {'tweet':''}
>     
>     # Each line of data needs 4 MERGE operations
>     # so every 250 lines, execute the transaction
>     size = 250
>     cnt = 1
>     
>     # Create index on two node attributes
>     tx.append("CREATE INDEX ON :User(name)")
>     tx.append("CREATE INDEX ON :Tweet(tid)")
>     tx.commit()
>     
>     tx = session.create_transaction()
>     with open('sample_data_5k.csv') as input:
>         for line in input.readlines(): 
>             
>             row = line[:-1].split(',')
>             
>             tweetID = row[0]
>             retweetID = row[1]
>             tweetUser = row[2]
>             retweetUser = row[3]
>             
>             # Create user node of tweet
>             createTweetUser = 'MERGE (user:User { name:{name} })'
>             user['name'] = tweetUser
>             tx.append(createTweetUser, user)
>             
>             # Create user node of retweet
>             createRetweetUser = 'MERGE (user:User { name:{name} })'
>             user['name'] = retweetUser
>             tx.append(createRetweetUser,user)
>             
>             # Create tweet node
>             createTweet = 'MERGE (tweet:Tweet { tid:{tweet} })'
>             tweet['tweet'] = tweetID 
>             tx.append(createTweet, tweet)
>             
>             # Create retweet node
>             createRetweet = 'MERGE (tweet:Tweet { tid:{tweet} })'
>             tweet['tweet'] = retweetID 
>             tx.append(createRetweet, tweet)
>             
>             # Four MERGE in one line, every 
>             if cnt % 250 == 0:
>                 tx.execute()
>             cnt += 1
>         
>         try:
>             tx.commit()
>         except cypher.TransactionError as e:
>             print("--------------------------------------------")
>             print(e.message)
> 
> 
> Link to the gist: https://gist.github.com/desertnerd/9004165
> 
> This is part of the messages.log file, I also post a gist link to the full log
> Enter code here...2014-02-14 16:12:55.614+0000 INFO  
> [o.n.k.i.DiagnosticsManager]: --- INITIALIZED diagnostics START ---
> 2014-02-14 16:12:55.618+0000 INFO  [o.n.k.i.DiagnosticsManager]: Neo4j Kernel 
> properties:
> 2014-02-14 16:12:55.624+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
> neostore.propertystore.db.mapped_memory=782M
> 2014-02-14 16:12:55.624+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
> neo_store=/home/wang/NEO4J/neo4j-community-2.0.1-RC4/data/graph.db/neostore
> 2014-02-14 16:12:55.624+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
> neostore.nodestore.db.mapped_memory=217M
> 2014-02-14 16:12:55.624+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
> neostore.propertystore.db.strings.mapped_memory=664M
> 2014-02-14 16:12:55.624+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
> neo4j.ext.udc.source=server
> 2014-02-14 16:12:55.624+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
> store_dir=data/graph.db
> 2014-02-14 16:12:55.624+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
> neostore.relationshipstore.db.mapped_memory=958M
> 2014-02-14 16:12:55.624+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
> keep_logical_logs=true
> 2014-02-14 16:12:55.624+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
> neostore.propertystore.db.arrays.mapped_memory=753M
> 2014-02-14 16:12:55.624+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
> remote_shell_enabled=true
> 2014-02-14 16:12:55.624+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
> ephemeral=false
> 2014-02-14 16:12:55.625+0000 INFO  [o.n.k.i.DiagnosticsManager]: Diagnostics 
> providers:
> 2014-02-14 16:12:55.626+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
> org.neo4j.kernel.configuration.Config
> 2014-02-14 16:12:55.626+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
> org.neo4j.kernel.info.DiagnosticsManager
> 2014-02-14 16:12:55.626+0000 INFO  [o.n.k.i.DiagnosticsManager]: SYSTEM_MEMORY
> 2014-02-14 16:12:55.626+0000 INFO  [o.n.k.i.DiagnosticsManager]: JAVA_MEMORY
> 2014-02-14 16:12:55.626+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
> OPERATING_SYSTEM
> 2014-02-14 16:12:55.626+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
> JAVA_VIRTUAL_MACHINE
> 2014-02-14 16:12:55.626+0000 INFO  [o.n.k.i.DiagnosticsManager]: CLASSPATH
> 2014-02-14 16:12:55.626+0000 INFO  [o.n.k.i.DiagnosticsManager]: LIBRARY_PATH
> 2014-02-14 16:12:55.626+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
> SYSTEM_PROPERTIES
> 2014-02-14 16:12:55.626+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
> LINUX_SCHEDULERS
> 2014-02-14 16:12:55.626+0000 INFO  [o.n.k.i.DiagnosticsManager]: NETWORK
> 2014-02-14 16:12:55.626+0000 INFO  [o.n.k.i.DiagnosticsManager]: System 
> memory information:
> 2014-02-14 16:12:55.628+0000 INFO  [o.n.k.i.DiagnosticsManager]: Total 
> Physical memory: 9.70 GB
> 2014-02-14 16:12:55.629+0000 INFO  [o.n.k.i.DiagnosticsManager]: Free 
> Physical memory: 6.95 GB
> 2014-02-14 16:12:55.629+0000 INFO  [o.n.k.i.DiagnosticsManager]: Committed 
> virtual memory: 4.20 GB
> 2014-02-14 16:12:55.629+0000 INFO  [o.n.k.i.DiagnosticsManager]: Total swap 
> space: 4.88 GB
> 2014-02-14 16:12:55.629+0000 INFO  [o.n.k.i.DiagnosticsManager]: Free swap 
> space: 4.88 GB
> 2014-02-14 16:12:55.630+0000 INFO  [o.n.k.i.DiagnosticsManager]: JVM memory 
> information:
> 2014-02-14 16:12:55.630+0000 INFO  [o.n.k.i.DiagnosticsManager]: Free  
> memory: 139.32 MB
> 2014-02-14 16:12:55.630+0000 INFO  [o.n.k.i.DiagnosticsManager]: Total 
> memory: 150.13 MB
> 2014-02-14 16:12:55.630+0000 INFO  [o.n.k.i.DiagnosticsManager]: Max   
> memory: 2.36 GB
> 2014-02-14 16:12:55.632+0000 INFO  [o.n.k.i.DiagnosticsManager]: Garbage 
> Collector: ParNew: [Par Eden Space, Par Survivor Space]
> 2014-02-14 16:12:55.632+0000 INFO  [o.n.k.i.DiagnosticsManager]: Garbage 
> Collector: ConcurrentMarkSweep: [Par Eden Space, Par Survivor Space, CMS Old 
> Gen, CMS Perm Gen]
> 2014-02-14 16:12:55.633+0000 INFO  [o.n.k.i.DiagnosticsManager]: Memory Pool: 
> Code Cache (Non-heap memory): committed=2.44 MB, used=688.63 kB, max=48.00 
> MB, threshold=0.00 B
> 2014-02-14 16:12:55.633+0000 INFO  [o.n.k.i.DiagnosticsManager]: Memory Pool: 
> Par Eden Space (Heap memory): committed=41.50 MB, used=3.30 MB, max=532.56 
> MB, threshold=?
> 2014-02-14 16:12:55.634+0000 INFO  [o.n.k.i.DiagnosticsManager]: Memory Pool: 
> Par Survivor Space (Heap memory): committed=5.13 MB, used=5.12 MB, max=66.50 
> MB, threshold=?
> 2014-02-14 16:12:55.634+0000 INFO  [o.n.k.i.DiagnosticsManager]: Memory Pool: 
> CMS Old Gen (Heap memory): committed=103.50 MB, used=2.38 MB, max=1.78 GB, 
> threshold=0.00 B
> 2014-02-14 16:12:55.635+0000 INFO  [o.n.k.i.DiagnosticsManager]: Memory Pool: 
> CMS Perm Gen (Non-heap memory): committed=20.75 MB, used=12.67 MB, max=82.00 
> MB, threshold=0.00 B
> 2014-02-14 16:12:55.635+0000 INFO  [o.n.k.i.DiagnosticsManager]: Operating 
> system information:
> 2014-02-14 16:12:55.635+0000 INFO  [o.n.k.i.DiagnosticsManager]: Operating 
> System: Linux; version: 3.12.8-300.fc20.x86_64; arch: amd64; cpus: 8
> 2014-02-14 16:12:55.636+0000 INFO  [o.n.k.i.DiagnosticsManager]: Max number 
> of file descriptors: 65535
> 2014-02-14 16:12:55.636+0000 INFO  [o.n.k.i.DiagnosticsManager]: Number of 
> open file descriptors: 72
> 2014-02-14 16:12:55.636+0000 INFO  [o.n.k.i.DiagnosticsManager]: Process id: 
> [email protected]
> 2014-02-14 16:12:55.636+0000 INFO  [o.n.k.i.DiagnosticsManager]: Byte order: 
> LITTLE_ENDIAN
> 2014-02-14 16:12:55.636+0000 INFO  [o.n.k.i.DiagnosticsManager]: Local 
> timezone: America/Phoenix
> 2014-02-14 16:12:55.636+0000 INFO  [o.n.k.i.DiagnosticsManager]: JVM 
> information:
> 2014-02-14 16:12:55.636+0000 INFO  [o.n.k.i.DiagnosticsManager]: VM Name: 
> Java HotSpot(TM) 64-Bit Server VM
> 2014-02-14 16:12:55.636+0000 INFO  [o.n.k.i.DiagnosticsManager]: VM Vendor: 
> Oracle Corporation
> 2014-02-14 16:12:55.636+0000 INFO  [o.n.k.i.DiagnosticsManager]: VM Version: 
> 24.51-b03
> 
> 
> Link to the gist: https://gist.github.com/desertnerd/9004094
> 
> Any suggestions will be appreciated.
> 
> Thank you.
> 
> On Saturday, February 1, 2014 2:08:03 PM UTC-7, Michael Hunger wrote:
> What is your actual write load?
> How big was your batch size? Currently for 2.1 1000 elements is sensible. It 
> will change back to 30-50k for Neo4j 2.1
> 
> 
> #0 use parameters
>> MERGE (user:User { name:{user_name} })', 'MERGE (tweet:Tweet { 
>> tweet_id:{tweet_id} })
> 
> #1 can you share your server config / memory / disk etc? (best to share your 
> data/graph.db/messages.log)
> #2 Make sure your driver uses the new transactional endpoint and streams data 
> back and forth
> 
> Usually you can insert 5-10k nodes per second in 2.0 with MERGE and 
> parameters in batched tx (1k tx-size)
> 
> 
> 
> Am 01.02.2014 um 17:51 schrieb Yun Wang <[email protected]>:
> 
>> Question background
>> We are building a graph (database) for twitter users and tweets (batched 
>> updates for new data).
>> We store as graph nodes: each-user, each-tweet
>> We store as graph edges: tweet-tweet relationships, and user-user 
>> relationships (derived, based on users who retweet or reply to others).
>>  
>> Problem: Updating the graph is very slow / not scalable
>>  
>> Goal: Scalable / efficient update of the existing Neo4J graph as new tweets 
>> come in (tweets translate to: nodes, edges).  Constraint: If a node (e.g., 
>> user) already exists, we do not want to duplicate it. Similarly, if an edge 
>> (user-user relationship) exists, we only want to update the edge weight.
>>  
>> What we have tried:
>> Option 1: We tried using Cypher's 'MERGE' function to uniquely insert. We 
>> also executed Cypher queries in a batch in order to reduce REST latency.
>>  
>> Sample Cypher query used to update database:
>>             'MERGE (user:User { name:'tom' })', 'MERGE (tweet:Tweet { 
>> tweet_id:'101' })'
>>  
>> We created an index on node attributes like 'name' of User node and 
>> 'tweet_id' of Tweet node.
>> We increased the 'open file descriptors' parameter to gain better 
>> performance in Linux.
>>  
>> Problems with Option 1:
>> Performance of checking uniqueness using 'MERGE' function dropped 
>> dramatically with scale / over time. For example, it took 2.7 second to 
>> insert 100 records when the database was empty. However, it took 62 seconds 
>> to insert the same amount of data with 100,000 existing records.
>>  
>> Option 2: The other option we have tried is to check uniqueness externally. 
>> That is, take all nodes and edges and create a hash table outside Neo4J 
>> (e.g., in Python or Java) to check uniqueness. This is faster than the 
>> earlier 'MERGE' function over time. However, it does not seem elegant to 
>> have to extract existing nodes before each batch update. It requires a read 
>> + write from the Neo4J database, instead of only a write.
>>  
>> We are wondering if there is an elegant solution for large data updating in 
>> Neo4j. We feel this may be a common question for many users, and someone may 
>> have previously encountered this and/or developed a robust solution.
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Neo4j" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> For more options, visit https://groups.google.com/groups/opt_out.
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: [Neo4j] Scalable solution for updating graph data in Neo4j 2.0

Reply via email to