Hi Michael, thank you for your reply.
I did a test using parameters and transnational endpoint. The speed was
still not satisfying.
It simply loads a CSV and inserts nodes.
The CSV file contains 5,000 lines in which each line generates four MERGE
operations for node insertion.
It takes 107 secs to finish the insertion.
This is the python code of the test:
from py2neo import neo4j
from py2neo import node, rel
from py2neo import cypher
if __name__ == '__main__':
# Use py2neo transaction
session = cypher.Session("http://localhost:7474")
tx = session.create_transaction()
# python Dictionary for parameters
user = {'name':''}
tweet = {'tweet':''}
# Each line of data needs 4 MERGE operations
# so every 250 lines, execute the transaction
size = 250
cnt = 1
# Create index on two node attributes
tx.append("CREATE INDEX ON :User(name)")
tx.append("CREATE INDEX ON :Tweet(tid)")
tx.commit()
tx = session.create_transaction()
with open('sample_data_5k.csv') as input:
for line in input.readlines():
row = line[:-1].split(',')
tweetID = row[0]
retweetID = row[1]
tweetUser = row[2]
retweetUser = row[3]
# Create user node of tweet
createTweetUser = 'MERGE (user:User { name:{name} })'
user['name'] = tweetUser
tx.append(createTweetUser, user)
# Create user node of retweet
createRetweetUser = 'MERGE (user:User { name:{name} })'
user['name'] = retweetUser
tx.append(createRetweetUser,user)
# Create tweet node
createTweet = 'MERGE (tweet:Tweet { tid:{tweet} })'
tweet['tweet'] = tweetID
tx.append(createTweet, tweet)
# Create retweet node
createRetweet = 'MERGE (tweet:Tweet { tid:{tweet} })'
tweet['tweet'] = retweetID
tx.append(createRetweet, tweet)
# Four MERGE in one line, every
if cnt % 250 == 0:
tx.execute()
cnt += 1
try:
tx.commit()
except cypher.TransactionError as e:
print("--------------------------------------------")
print(e.message)
Link to the gist: https://gist.github.com/desertnerd/9004165
This is part of the messages.log file, I also post a gist link to the full
log
Enter code here...2014-02-14 16:12:55.614+0000 INFO
[o.n.k.i.DiagnosticsManager]: --- INITIALIZED diagnostics START ---
2014-02-14 16:12:55.618+0000 INFO [o.n.k.i.DiagnosticsManager]: Neo4j
Kernel properties:
2014-02-14 16:12:55.624+0000 INFO [o.n.k.i.DiagnosticsManager]:
neostore.propertystore.db.mapped_memory=782M
2014-02-14 16:12:55.624+0000 INFO [o.n.k.i.DiagnosticsManager]:
neo_store=/home/wang/NEO4J/neo4j-community-2.0.1-RC4/data/graph.db/neostore
2014-02-14 16:12:55.624+0000 INFO [o.n.k.i.DiagnosticsManager]:
neostore.nodestore.db.mapped_memory=217M
2014-02-14 16:12:55.624+0000 INFO [o.n.k.i.DiagnosticsManager]:
neostore.propertystore.db.strings.mapped_memory=664M
2014-02-14 16:12:55.624+0000 INFO [o.n.k.i.DiagnosticsManager]:
neo4j.ext.udc.source=server
2014-02-14 16:12:55.624+0000 INFO [o.n.k.i.DiagnosticsManager]:
store_dir=data/graph.db
2014-02-14 16:12:55.624+0000 INFO [o.n.k.i.DiagnosticsManager]:
neostore.relationshipstore.db.mapped_memory=958M
2014-02-14 16:12:55.624+0000 INFO [o.n.k.i.DiagnosticsManager]:
keep_logical_logs=true
2014-02-14 16:12:55.624+0000 INFO [o.n.k.i.DiagnosticsManager]:
neostore.propertystore.db.arrays.mapped_memory=753M
2014-02-14 16:12:55.624+0000 INFO [o.n.k.i.DiagnosticsManager]:
remote_shell_enabled=true
2014-02-14 16:12:55.624+0000 INFO [o.n.k.i.DiagnosticsManager]:
ephemeral=false
2014-02-14 16:12:55.625+0000 INFO [o.n.k.i.DiagnosticsManager]:
Diagnostics providers:
2014-02-14 16:12:55.626+0000 INFO [o.n.k.i.DiagnosticsManager]:
org.neo4j.kernel.configuration.Config
2014-02-14 16:12:55.626+0000 INFO [o.n.k.i.DiagnosticsManager]:
org.neo4j.kernel.info.DiagnosticsManager
2014-02-14 16:12:55.626+0000 INFO [o.n.k.i.DiagnosticsManager]:
SYSTEM_MEMORY
2014-02-14 16:12:55.626+0000 INFO [o.n.k.i.DiagnosticsManager]: JAVA_MEMORY
2014-02-14 16:12:55.626+0000 INFO [o.n.k.i.DiagnosticsManager]:
OPERATING_SYSTEM
2014-02-14 16:12:55.626+0000 INFO [o.n.k.i.DiagnosticsManager]:
JAVA_VIRTUAL_MACHINE
2014-02-14 16:12:55.626+0000 INFO [o.n.k.i.DiagnosticsManager]: CLASSPATH
2014-02-14 16:12:55.626+0000 INFO [o.n.k.i.DiagnosticsManager]:
LIBRARY_PATH
2014-02-14 16:12:55.626+0000 INFO [o.n.k.i.DiagnosticsManager]:
SYSTEM_PROPERTIES
2014-02-14 16:12:55.626+0000 INFO [o.n.k.i.DiagnosticsManager]:
LINUX_SCHEDULERS
2014-02-14 16:12:55.626+0000 INFO [o.n.k.i.DiagnosticsManager]: NETWORK
2014-02-14 16:12:55.626+0000 INFO [o.n.k.i.DiagnosticsManager]: System
memory information:
2014-02-14 16:12:55.628+0000 INFO [o.n.k.i.DiagnosticsManager]: Total
Physical memory: 9.70 GB
2014-02-14 16:12:55.629+0000 INFO [o.n.k.i.DiagnosticsManager]: Free
Physical memory: 6.95 GB
2014-02-14 16:12:55.629+0000 INFO [o.n.k.i.DiagnosticsManager]: Committed
virtual memory: 4.20 GB
2014-02-14 16:12:55.629+0000 INFO [o.n.k.i.DiagnosticsManager]: Total swap
space: 4.88 GB
2014-02-14 16:12:55.629+0000 INFO [o.n.k.i.DiagnosticsManager]: Free swap
space: 4.88 GB
2014-02-14 16:12:55.630+0000 INFO [o.n.k.i.DiagnosticsManager]: JVM memory
information:
2014-02-14 16:12:55.630+0000 INFO [o.n.k.i.DiagnosticsManager]: Free
memory: 139.32 MB
2014-02-14 16:12:55.630+0000 INFO [o.n.k.i.DiagnosticsManager]: Total
memory: 150.13 MB
2014-02-14 16:12:55.630+0000 INFO [o.n.k.i.DiagnosticsManager]: Max
memory: 2.36 GB
2014-02-14 16:12:55.632+0000 INFO [o.n.k.i.DiagnosticsManager]: Garbage
Collector: ParNew: [Par Eden Space, Par Survivor Space]
2014-02-14 16:12:55.632+0000 INFO [o.n.k.i.DiagnosticsManager]: Garbage
Collector: ConcurrentMarkSweep: [Par Eden Space, Par Survivor Space, CMS
Old Gen, CMS Perm Gen]
2014-02-14 16:12:55.633+0000 INFO [o.n.k.i.DiagnosticsManager]: Memory
Pool: Code Cache (Non-heap memory): committed=2.44 MB, used=688.63 kB,
max=48.00 MB, threshold=0.00 B
2014-02-14 16:12:55.633+0000 INFO [o.n.k.i.DiagnosticsManager]: Memory
Pool: Par Eden Space (Heap memory): committed=41.50 MB, used=3.30 MB,
max=532.56 MB, threshold=?
2014-02-14 16:12:55.634+0000 INFO [o.n.k.i.DiagnosticsManager]: Memory
Pool: Par Survivor Space (Heap memory): committed=5.13 MB, used=5.12 MB,
max=66.50 MB, threshold=?
2014-02-14 16:12:55.634+0000 INFO [o.n.k.i.DiagnosticsManager]: Memory
Pool: CMS Old Gen (Heap memory): committed=103.50 MB, used=2.38 MB,
max=1.78 GB, threshold=0.00 B
2014-02-14 16:12:55.635+0000 INFO [o.n.k.i.DiagnosticsManager]: Memory
Pool: CMS Perm Gen (Non-heap memory): committed=20.75 MB, used=12.67 MB,
max=82.00 MB, threshold=0.00 B
2014-02-14 16:12:55.635+0000 INFO [o.n.k.i.DiagnosticsManager]: Operating
system information:
2014-02-14 16:12:55.635+0000 INFO [o.n.k.i.DiagnosticsManager]: Operating
System: Linux; version: 3.12.8-300.fc20.x86_64; arch: amd64; cpus: 8
2014-02-14 16:12:55.636+0000 INFO [o.n.k.i.DiagnosticsManager]: Max number
of file descriptors: 65535
2014-02-14 16:12:55.636+0000 INFO [o.n.k.i.DiagnosticsManager]: Number of
open file descriptors: 72
2014-02-14 16:12:55.636+0000 INFO [o.n.k.i.DiagnosticsManager]: Process
id: [email protected]
2014-02-14 16:12:55.636+0000 INFO [o.n.k.i.DiagnosticsManager]: Byte
order: LITTLE_ENDIAN
2014-02-14 16:12:55.636+0000 INFO [o.n.k.i.DiagnosticsManager]: Local
timezone: America/Phoenix
2014-02-14 16:12:55.636+0000 INFO [o.n.k.i.DiagnosticsManager]: JVM
information:
2014-02-14 16:12:55.636+0000 INFO [o.n.k.i.DiagnosticsManager]: VM Name:
Java HotSpot(TM) 64-Bit Server VM
2014-02-14 16:12:55.636+0000 INFO [o.n.k.i.DiagnosticsManager]: VM Vendor:
Oracle Corporation
2014-02-14 16:12:55.636+0000 INFO [o.n.k.i.DiagnosticsManager]: VM
Version: 24.51-b03
Link to the gist: https://gist.github.com/desertnerd/9004094
Any suggestions will be appreciated.
Thank you.
On Saturday, February 1, 2014 2:08:03 PM UTC-7, Michael Hunger wrote:
>
> What is your actual write load?
> How big was your batch size? Currently for 2.1 1000 elements is sensible.
> It will change back to 30-50k for Neo4j 2.1
>
>
> #0 use parameters
>
> MERGE (user:User { name:{user_name} })', 'MERGE (tweet:Tweet {
> tweet_id:{tweet_id} })
>
> #1 can you share your server config / memory / disk etc? (best to share
> your data/graph.db/messages.log)
> #2 Make sure your driver uses the new transactional endpoint and streams
> data back and forth
>
> Usually you can insert 5-10k nodes per second in 2.0 with MERGE and
> parameters in batched tx (1k tx-size)
>
>
>
> Am 01.02.2014 um 17:51 schrieb Yun Wang <[email protected] <javascript:>
> >:
>
> *Question background*
> We are building a graph (database) for twitter users and tweets (batched
> updates for new data).
> We store as graph nodes: each-user, each-tweet
> We store as graph edges: tweet-tweet relationships, and user-user
> relationships (derived, based on users who retweet or reply to others).
>
> *Problem*: Updating the graph is very slow / not scalable
>
> *Goal*: Scalable / efficient update of the existing Neo4J graph as new
> tweets come in (tweets translate to: nodes, edges). *Constraint*: If a
> node (e.g., user) already exists, we do not want to duplicate it.
> Similarly, if an edge (user-user relationship) exists, we only want to
> update the edge weight.
>
> *What we have tried*:
> *Option 1*: We tried using Cypher's 'MERGE' function to uniquely insert.
> We also executed Cypher queries in a batch in order to reduce REST latency.
>
> Sample Cypher query used to update database:
> 'MERGE (user:User { name:'tom' })', 'MERGE (tweet:Tweet {
> tweet_id:'101' })'
>
> We created an index on node attributes like 'name' of User node and
> 'tweet_id' of Tweet node.
> We increased the 'open file descriptors' parameter to gain better
> performance in Linux.
>
> *Problems with Option 1*:
> Performance of checking uniqueness using 'MERGE' function dropped
> dramatically with scale / over time. For example, it took 2.7 second to
> insert 100 records when the database was empty. However, it took 62 seconds
> to insert the same amount of data with 100,000 existing records.
>
> *Option 2*: The other option we have tried is to check uniqueness
> externally. That is, take all nodes and edges and create a hash table
> outside Neo4J (e.g., in Python or Java) to check uniqueness. This is faster
> than the earlier 'MERGE' function over time. However, it does not seem
> elegant to have to extract existing nodes before each batch update. It
> requires a read + write from the Neo4J database, instead of only a write.
>
> We are wondering if there is an elegant solution for large data updating
> in Neo4j. We feel this may be a common question for many users, and someone
> may have previously encountered this and/or developed a robust solution.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected] <javascript:>.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>
--
You received this message because you are subscribed to the Google Groups
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.