Hi, that's really slow, there is something off do you by chance also have the csv available for testing?
also what version is this? neo4j-community-2.0.1-RC4 there is no RC4 ?? Michael Am 14.02.2014 um 17:45 schrieb Yun Wang <[email protected]>: > Hi Michael, thank you for your reply. > I did a test using parameters and transnational endpoint. The speed was still > not satisfying. > > It simply loads a CSV and inserts nodes. > The CSV file contains 5,000 lines in which each line generates four MERGE > operations for node insertion. > It takes 107 secs to finish the insertion. > > This is the python code of the test: > from py2neo import neo4j > from py2neo import node, rel > from py2neo import cypher > > if __name__ == '__main__': > > # Use py2neo transaction > session = cypher.Session("http://localhost:7474") > tx = session.create_transaction() > > # python Dictionary for parameters > user = {'name':''} > tweet = {'tweet':''} > > # Each line of data needs 4 MERGE operations > # so every 250 lines, execute the transaction > size = 250 > cnt = 1 > > # Create index on two node attributes > tx.append("CREATE INDEX ON :User(name)") > tx.append("CREATE INDEX ON :Tweet(tid)") > tx.commit() > > tx = session.create_transaction() > with open('sample_data_5k.csv') as input: > for line in input.readlines(): > > row = line[:-1].split(',') > > tweetID = row[0] > retweetID = row[1] > tweetUser = row[2] > retweetUser = row[3] > > # Create user node of tweet > createTweetUser = 'MERGE (user:User { name:{name} })' > user['name'] = tweetUser > tx.append(createTweetUser, user) > > # Create user node of retweet > createRetweetUser = 'MERGE (user:User { name:{name} })' > user['name'] = retweetUser > tx.append(createRetweetUser,user) > > # Create tweet node > createTweet = 'MERGE (tweet:Tweet { tid:{tweet} })' > tweet['tweet'] = tweetID > tx.append(createTweet, tweet) > > # Create retweet node > createRetweet = 'MERGE (tweet:Tweet { tid:{tweet} })' > tweet['tweet'] = retweetID > tx.append(createRetweet, tweet) > > # Four MERGE in one line, every > if cnt % 250 == 0: > tx.execute() > cnt += 1 > > try: > tx.commit() > except cypher.TransactionError as e: > print("--------------------------------------------") > print(e.message) > > > Link to the gist: https://gist.github.com/desertnerd/9004165 > > This is part of the messages.log file, I also post a gist link to the full log > Enter code here...2014-02-14 16:12:55.614+0000 INFO > [o.n.k.i.DiagnosticsManager]: --- INITIALIZED diagnostics START --- > 2014-02-14 16:12:55.618+0000 INFO [o.n.k.i.DiagnosticsManager]: Neo4j Kernel > properties: > 2014-02-14 16:12:55.624+0000 INFO [o.n.k.i.DiagnosticsManager]: > neostore.propertystore.db.mapped_memory=782M > 2014-02-14 16:12:55.624+0000 INFO [o.n.k.i.DiagnosticsManager]: > neo_store=/home/wang/NEO4J/neo4j-community-2.0.1-RC4/data/graph.db/neostore > 2014-02-14 16:12:55.624+0000 INFO [o.n.k.i.DiagnosticsManager]: > neostore.nodestore.db.mapped_memory=217M > 2014-02-14 16:12:55.624+0000 INFO [o.n.k.i.DiagnosticsManager]: > neostore.propertystore.db.strings.mapped_memory=664M > 2014-02-14 16:12:55.624+0000 INFO [o.n.k.i.DiagnosticsManager]: > neo4j.ext.udc.source=server > 2014-02-14 16:12:55.624+0000 INFO [o.n.k.i.DiagnosticsManager]: > store_dir=data/graph.db > 2014-02-14 16:12:55.624+0000 INFO [o.n.k.i.DiagnosticsManager]: > neostore.relationshipstore.db.mapped_memory=958M > 2014-02-14 16:12:55.624+0000 INFO [o.n.k.i.DiagnosticsManager]: > keep_logical_logs=true > 2014-02-14 16:12:55.624+0000 INFO [o.n.k.i.DiagnosticsManager]: > neostore.propertystore.db.arrays.mapped_memory=753M > 2014-02-14 16:12:55.624+0000 INFO [o.n.k.i.DiagnosticsManager]: > remote_shell_enabled=true > 2014-02-14 16:12:55.624+0000 INFO [o.n.k.i.DiagnosticsManager]: > ephemeral=false > 2014-02-14 16:12:55.625+0000 INFO [o.n.k.i.DiagnosticsManager]: Diagnostics > providers: > 2014-02-14 16:12:55.626+0000 INFO [o.n.k.i.DiagnosticsManager]: > org.neo4j.kernel.configuration.Config > 2014-02-14 16:12:55.626+0000 INFO [o.n.k.i.DiagnosticsManager]: > org.neo4j.kernel.info.DiagnosticsManager > 2014-02-14 16:12:55.626+0000 INFO [o.n.k.i.DiagnosticsManager]: SYSTEM_MEMORY > 2014-02-14 16:12:55.626+0000 INFO [o.n.k.i.DiagnosticsManager]: JAVA_MEMORY > 2014-02-14 16:12:55.626+0000 INFO [o.n.k.i.DiagnosticsManager]: > OPERATING_SYSTEM > 2014-02-14 16:12:55.626+0000 INFO [o.n.k.i.DiagnosticsManager]: > JAVA_VIRTUAL_MACHINE > 2014-02-14 16:12:55.626+0000 INFO [o.n.k.i.DiagnosticsManager]: CLASSPATH > 2014-02-14 16:12:55.626+0000 INFO [o.n.k.i.DiagnosticsManager]: LIBRARY_PATH > 2014-02-14 16:12:55.626+0000 INFO [o.n.k.i.DiagnosticsManager]: > SYSTEM_PROPERTIES > 2014-02-14 16:12:55.626+0000 INFO [o.n.k.i.DiagnosticsManager]: > LINUX_SCHEDULERS > 2014-02-14 16:12:55.626+0000 INFO [o.n.k.i.DiagnosticsManager]: NETWORK > 2014-02-14 16:12:55.626+0000 INFO [o.n.k.i.DiagnosticsManager]: System > memory information: > 2014-02-14 16:12:55.628+0000 INFO [o.n.k.i.DiagnosticsManager]: Total > Physical memory: 9.70 GB > 2014-02-14 16:12:55.629+0000 INFO [o.n.k.i.DiagnosticsManager]: Free > Physical memory: 6.95 GB > 2014-02-14 16:12:55.629+0000 INFO [o.n.k.i.DiagnosticsManager]: Committed > virtual memory: 4.20 GB > 2014-02-14 16:12:55.629+0000 INFO [o.n.k.i.DiagnosticsManager]: Total swap > space: 4.88 GB > 2014-02-14 16:12:55.629+0000 INFO [o.n.k.i.DiagnosticsManager]: Free swap > space: 4.88 GB > 2014-02-14 16:12:55.630+0000 INFO [o.n.k.i.DiagnosticsManager]: JVM memory > information: > 2014-02-14 16:12:55.630+0000 INFO [o.n.k.i.DiagnosticsManager]: Free > memory: 139.32 MB > 2014-02-14 16:12:55.630+0000 INFO [o.n.k.i.DiagnosticsManager]: Total > memory: 150.13 MB > 2014-02-14 16:12:55.630+0000 INFO [o.n.k.i.DiagnosticsManager]: Max > memory: 2.36 GB > 2014-02-14 16:12:55.632+0000 INFO [o.n.k.i.DiagnosticsManager]: Garbage > Collector: ParNew: [Par Eden Space, Par Survivor Space] > 2014-02-14 16:12:55.632+0000 INFO [o.n.k.i.DiagnosticsManager]: Garbage > Collector: ConcurrentMarkSweep: [Par Eden Space, Par Survivor Space, CMS Old > Gen, CMS Perm Gen] > 2014-02-14 16:12:55.633+0000 INFO [o.n.k.i.DiagnosticsManager]: Memory Pool: > Code Cache (Non-heap memory): committed=2.44 MB, used=688.63 kB, max=48.00 > MB, threshold=0.00 B > 2014-02-14 16:12:55.633+0000 INFO [o.n.k.i.DiagnosticsManager]: Memory Pool: > Par Eden Space (Heap memory): committed=41.50 MB, used=3.30 MB, max=532.56 > MB, threshold=? > 2014-02-14 16:12:55.634+0000 INFO [o.n.k.i.DiagnosticsManager]: Memory Pool: > Par Survivor Space (Heap memory): committed=5.13 MB, used=5.12 MB, max=66.50 > MB, threshold=? > 2014-02-14 16:12:55.634+0000 INFO [o.n.k.i.DiagnosticsManager]: Memory Pool: > CMS Old Gen (Heap memory): committed=103.50 MB, used=2.38 MB, max=1.78 GB, > threshold=0.00 B > 2014-02-14 16:12:55.635+0000 INFO [o.n.k.i.DiagnosticsManager]: Memory Pool: > CMS Perm Gen (Non-heap memory): committed=20.75 MB, used=12.67 MB, max=82.00 > MB, threshold=0.00 B > 2014-02-14 16:12:55.635+0000 INFO [o.n.k.i.DiagnosticsManager]: Operating > system information: > 2014-02-14 16:12:55.635+0000 INFO [o.n.k.i.DiagnosticsManager]: Operating > System: Linux; version: 3.12.8-300.fc20.x86_64; arch: amd64; cpus: 8 > 2014-02-14 16:12:55.636+0000 INFO [o.n.k.i.DiagnosticsManager]: Max number > of file descriptors: 65535 > 2014-02-14 16:12:55.636+0000 INFO [o.n.k.i.DiagnosticsManager]: Number of > open file descriptors: 72 > 2014-02-14 16:12:55.636+0000 INFO [o.n.k.i.DiagnosticsManager]: Process id: > [email protected] > 2014-02-14 16:12:55.636+0000 INFO [o.n.k.i.DiagnosticsManager]: Byte order: > LITTLE_ENDIAN > 2014-02-14 16:12:55.636+0000 INFO [o.n.k.i.DiagnosticsManager]: Local > timezone: America/Phoenix > 2014-02-14 16:12:55.636+0000 INFO [o.n.k.i.DiagnosticsManager]: JVM > information: > 2014-02-14 16:12:55.636+0000 INFO [o.n.k.i.DiagnosticsManager]: VM Name: > Java HotSpot(TM) 64-Bit Server VM > 2014-02-14 16:12:55.636+0000 INFO [o.n.k.i.DiagnosticsManager]: VM Vendor: > Oracle Corporation > 2014-02-14 16:12:55.636+0000 INFO [o.n.k.i.DiagnosticsManager]: VM Version: > 24.51-b03 > > > Link to the gist: https://gist.github.com/desertnerd/9004094 > > Any suggestions will be appreciated. > > Thank you. > > On Saturday, February 1, 2014 2:08:03 PM UTC-7, Michael Hunger wrote: > What is your actual write load? > How big was your batch size? Currently for 2.1 1000 elements is sensible. It > will change back to 30-50k for Neo4j 2.1 > > > #0 use parameters >> MERGE (user:User { name:{user_name} })', 'MERGE (tweet:Tweet { >> tweet_id:{tweet_id} }) > > #1 can you share your server config / memory / disk etc? (best to share your > data/graph.db/messages.log) > #2 Make sure your driver uses the new transactional endpoint and streams data > back and forth > > Usually you can insert 5-10k nodes per second in 2.0 with MERGE and > parameters in batched tx (1k tx-size) > > > > Am 01.02.2014 um 17:51 schrieb Yun Wang <[email protected]>: > >> Question background >> We are building a graph (database) for twitter users and tweets (batched >> updates for new data). >> We store as graph nodes: each-user, each-tweet >> We store as graph edges: tweet-tweet relationships, and user-user >> relationships (derived, based on users who retweet or reply to others). >> >> Problem: Updating the graph is very slow / not scalable >> >> Goal: Scalable / efficient update of the existing Neo4J graph as new tweets >> come in (tweets translate to: nodes, edges). Constraint: If a node (e.g., >> user) already exists, we do not want to duplicate it. Similarly, if an edge >> (user-user relationship) exists, we only want to update the edge weight. >> >> What we have tried: >> Option 1: We tried using Cypher's 'MERGE' function to uniquely insert. We >> also executed Cypher queries in a batch in order to reduce REST latency. >> >> Sample Cypher query used to update database: >> 'MERGE (user:User { name:'tom' })', 'MERGE (tweet:Tweet { >> tweet_id:'101' })' >> >> We created an index on node attributes like 'name' of User node and >> 'tweet_id' of Tweet node. >> We increased the 'open file descriptors' parameter to gain better >> performance in Linux. >> >> Problems with Option 1: >> Performance of checking uniqueness using 'MERGE' function dropped >> dramatically with scale / over time. For example, it took 2.7 second to >> insert 100 records when the database was empty. However, it took 62 seconds >> to insert the same amount of data with 100,000 existing records. >> >> Option 2: The other option we have tried is to check uniqueness externally. >> That is, take all nodes and edges and create a hash table outside Neo4J >> (e.g., in Python or Java) to check uniqueness. This is faster than the >> earlier 'MERGE' function over time. However, it does not seem elegant to >> have to extract existing nodes before each batch update. It requires a read >> + write from the Neo4J database, instead of only a write. >> >> We are wondering if there is an elegant solution for large data updating in >> Neo4j. We feel this may be a common question for many users, and someone may >> have previously encountered this and/or developed a robust solution. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Neo4j" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> For more options, visit https://groups.google.com/groups/opt_out. > > > -- > You received this message because you are subscribed to the Google Groups > "Neo4j" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/groups/opt_out. -- You received this message because you are subscribed to the Google Groups "Neo4j" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.
