Re: [Neo4j] Scalable solution for updating graph data in Neo4j 2.0

Yun Wang Fri, 14 Feb 2014 08:46:28 -0800

Hi Michael, thank you for your reply. 
I did a test using parameters and transnational endpoint. The speed was 
still not satisfying.

It simply loads a CSV and inserts nodes.
The CSV file contains 5,000 lines in which each line generates four MERGE 
operations for node insertion.
It takes 107 secs to finish the insertion.

This is the python code of the test:
from py2neo import neo4j
from py2neo import node, rel
from py2neo import cypher

if __name__ == '__main__':

    # Use py2neo transaction
    session = cypher.Session("http://localhost:7474";)
    tx = session.create_transaction()

    # python Dictionary for parameters
    user = {'name':''}
    tweet = {'tweet':''}

    # Each line of data needs 4 MERGE operations
    # so every 250 lines, execute the transaction
    size = 250
    cnt = 1

    # Create index on two node attributes
    tx.append("CREATE INDEX ON :User(name)")
    tx.append("CREATE INDEX ON :Tweet(tid)")
    tx.commit()

    tx = session.create_transaction()
    with open('sample_data_5k.csv') as input:
        for line in input.readlines(): 

            row = line[:-1].split(',')

            tweetID = row[0]
            retweetID = row[1]
            tweetUser = row[2]
            retweetUser = row[3]

            # Create user node of tweet
            createTweetUser = 'MERGE (user:User { name:{name} })'
            user['name'] = tweetUser
            tx.append(createTweetUser, user)

            # Create user node of retweet
            createRetweetUser = 'MERGE (user:User { name:{name} })'
            user['name'] = retweetUser
            tx.append(createRetweetUser,user)

            # Create tweet node
            createTweet = 'MERGE (tweet:Tweet { tid:{tweet} })'
            tweet['tweet'] = tweetID 
            tx.append(createTweet, tweet)

            # Create retweet node
            createRetweet = 'MERGE (tweet:Tweet { tid:{tweet} })'
            tweet['tweet'] = retweetID 
            tx.append(createRetweet, tweet)

            # Four MERGE in one line, every 
            if cnt % 250 == 0:
                tx.execute()
            cnt += 1

        try:
            tx.commit()
        except cypher.TransactionError as e:
            print("--------------------------------------------")
            print(e.message)

Link to the gist: https://gist.github.com/desertnerd/9004165

This is part of the messages.log file, I also post a gist link to the full 
log
Enter code here...2014-02-14 16:12:55.614+0000 INFO 
 [o.n.k.i.DiagnosticsManager]: --- INITIALIZED diagnostics START ---
2014-02-14 16:12:55.618+0000 INFO  [o.n.k.i.DiagnosticsManager]: Neo4j 
Kernel properties:
2014-02-14 16:12:55.624+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
neostore.propertystore.db.mapped_memory=782M
2014-02-14 16:12:55.624+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
neo_store=/home/wang/NEO4J/neo4j-community-2.0.1-RC4/data/graph.db/neostore
2014-02-14 16:12:55.624+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
neostore.nodestore.db.mapped_memory=217M
2014-02-14 16:12:55.624+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
neostore.propertystore.db.strings.mapped_memory=664M
2014-02-14 16:12:55.624+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
neo4j.ext.udc.source=server
2014-02-14 16:12:55.624+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
store_dir=data/graph.db
2014-02-14 16:12:55.624+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
neostore.relationshipstore.db.mapped_memory=958M
2014-02-14 16:12:55.624+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
keep_logical_logs=true
2014-02-14 16:12:55.624+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
neostore.propertystore.db.arrays.mapped_memory=753M
2014-02-14 16:12:55.624+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
remote_shell_enabled=true
2014-02-14 16:12:55.624+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
ephemeral=false
2014-02-14 16:12:55.625+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
Diagnostics providers:
2014-02-14 16:12:55.626+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
org.neo4j.kernel.configuration.Config
2014-02-14 16:12:55.626+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
org.neo4j.kernel.info.DiagnosticsManager
2014-02-14 16:12:55.626+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
SYSTEM_MEMORY
2014-02-14 16:12:55.626+0000 INFO  [o.n.k.i.DiagnosticsManager]: JAVA_MEMORY
2014-02-14 16:12:55.626+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
OPERATING_SYSTEM
2014-02-14 16:12:55.626+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
JAVA_VIRTUAL_MACHINE
2014-02-14 16:12:55.626+0000 INFO  [o.n.k.i.DiagnosticsManager]: CLASSPATH
2014-02-14 16:12:55.626+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
LIBRARY_PATH
2014-02-14 16:12:55.626+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
SYSTEM_PROPERTIES
2014-02-14 16:12:55.626+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
LINUX_SCHEDULERS
2014-02-14 16:12:55.626+0000 INFO  [o.n.k.i.DiagnosticsManager]: NETWORK
2014-02-14 16:12:55.626+0000 INFO  [o.n.k.i.DiagnosticsManager]: System 
memory information:
2014-02-14 16:12:55.628+0000 INFO  [o.n.k.i.DiagnosticsManager]: Total 
Physical memory: 9.70 GB
2014-02-14 16:12:55.629+0000 INFO  [o.n.k.i.DiagnosticsManager]: Free 
Physical memory: 6.95 GB
2014-02-14 16:12:55.629+0000 INFO  [o.n.k.i.DiagnosticsManager]: Committed 
virtual memory: 4.20 GB
2014-02-14 16:12:55.629+0000 INFO  [o.n.k.i.DiagnosticsManager]: Total swap 
space: 4.88 GB
2014-02-14 16:12:55.629+0000 INFO  [o.n.k.i.DiagnosticsManager]: Free swap 
space: 4.88 GB
2014-02-14 16:12:55.630+0000 INFO  [o.n.k.i.DiagnosticsManager]: JVM memory 
information:
2014-02-14 16:12:55.630+0000 INFO  [o.n.k.i.DiagnosticsManager]: Free 
 memory: 139.32 MB
2014-02-14 16:12:55.630+0000 INFO  [o.n.k.i.DiagnosticsManager]: Total 
memory: 150.13 MB
2014-02-14 16:12:55.630+0000 INFO  [o.n.k.i.DiagnosticsManager]: Max   
memory: 2.36 GB
2014-02-14 16:12:55.632+0000 INFO  [o.n.k.i.DiagnosticsManager]: Garbage 
Collector: ParNew: [Par Eden Space, Par Survivor Space]
2014-02-14 16:12:55.632+0000 INFO  [o.n.k.i.DiagnosticsManager]: Garbage 
Collector: ConcurrentMarkSweep: [Par Eden Space, Par Survivor Space, CMS 
Old Gen, CMS Perm Gen]
2014-02-14 16:12:55.633+0000 INFO  [o.n.k.i.DiagnosticsManager]: Memory 
Pool: Code Cache (Non-heap memory): committed=2.44 MB, used=688.63 kB, 
max=48.00 MB, threshold=0.00 B
2014-02-14 16:12:55.633+0000 INFO  [o.n.k.i.DiagnosticsManager]: Memory 
Pool: Par Eden Space (Heap memory): committed=41.50 MB, used=3.30 MB, 
max=532.56 MB, threshold=?
2014-02-14 16:12:55.634+0000 INFO  [o.n.k.i.DiagnosticsManager]: Memory 
Pool: Par Survivor Space (Heap memory): committed=5.13 MB, used=5.12 MB, 
max=66.50 MB, threshold=?
2014-02-14 16:12:55.634+0000 INFO  [o.n.k.i.DiagnosticsManager]: Memory 
Pool: CMS Old Gen (Heap memory): committed=103.50 MB, used=2.38 MB, 
max=1.78 GB, threshold=0.00 B
2014-02-14 16:12:55.635+0000 INFO  [o.n.k.i.DiagnosticsManager]: Memory 
Pool: CMS Perm Gen (Non-heap memory): committed=20.75 MB, used=12.67 MB, 
max=82.00 MB, threshold=0.00 B
2014-02-14 16:12:55.635+0000 INFO  [o.n.k.i.DiagnosticsManager]: Operating 
system information:
2014-02-14 16:12:55.635+0000 INFO  [o.n.k.i.DiagnosticsManager]: Operating 
System: Linux; version: 3.12.8-300.fc20.x86_64; arch: amd64; cpus: 8
2014-02-14 16:12:55.636+0000 INFO  [o.n.k.i.DiagnosticsManager]: Max number 
of file descriptors: 65535
2014-02-14 16:12:55.636+0000 INFO  [o.n.k.i.DiagnosticsManager]: Number of 
open file descriptors: 72
2014-02-14 16:12:55.636+0000 INFO  [o.n.k.i.DiagnosticsManager]: Process 
id: [email protected]
2014-02-14 16:12:55.636+0000 INFO  [o.n.k.i.DiagnosticsManager]: Byte 
order: LITTLE_ENDIAN
2014-02-14 16:12:55.636+0000 INFO  [o.n.k.i.DiagnosticsManager]: Local 
timezone: America/Phoenix
2014-02-14 16:12:55.636+0000 INFO  [o.n.k.i.DiagnosticsManager]: JVM 
information:
2014-02-14 16:12:55.636+0000 INFO  [o.n.k.i.DiagnosticsManager]: VM Name: 
Java HotSpot(TM) 64-Bit Server VM
2014-02-14 16:12:55.636+0000 INFO  [o.n.k.i.DiagnosticsManager]: VM Vendor: 
Oracle Corporation
2014-02-14 16:12:55.636+0000 INFO  [o.n.k.i.DiagnosticsManager]: VM 
Version: 24.51-b03

Link to the gist: https://gist.github.com/desertnerd/9004094

Any suggestions will be appreciated.

Thank you.

On Saturday, February 1, 2014 2:08:03 PM UTC-7, Michael Hunger wrote:
>
> What is your actual write load?
> How big was your batch size? Currently for 2.1 1000 elements is sensible. 
> It will change back to 30-50k for Neo4j 2.1
>
>
> #0 use parameters
>
> MERGE (user:User { name:{user_name} })', 'MERGE (tweet:Tweet { 
> tweet_id:{tweet_id} })
>
> #1 can you share your server config / memory / disk etc? (best to share 
> your data/graph.db/messages.log)
> #2 Make sure your driver uses the new transactional endpoint and streams 
> data back and forth
>
> Usually you can insert 5-10k nodes per second in 2.0 with MERGE and 
> parameters in batched tx (1k tx-size)
>
>
>
> Am 01.02.2014 um 17:51 schrieb Yun Wang <[email protected] <javascript:>
> >:
>
> *Question background*
> We are building a graph (database) for twitter users and tweets (batched 
> updates for new data).
> We store as graph nodes: each-user, each-tweet
> We store as graph edges: tweet-tweet relationships, and user-user 
> relationships (derived, based on users who retweet or reply to others).
>  
> *Problem*: Updating the graph is very slow / not scalable
>  
> *Goal*: Scalable / efficient update of the existing Neo4J graph as new 
> tweets come in (tweets translate to: nodes, edges).  *Constraint*: If a 
> node (e.g., user) already exists, we do not want to duplicate it. 
> Similarly, if an edge (user-user relationship) exists, we only want to 
> update the edge weight.
>  
> *What we have tried*:
> *Option 1*: We tried using Cypher's 'MERGE' function to uniquely insert. 
> We also executed Cypher queries in a batch in order to reduce REST latency.
>  
> Sample Cypher query used to update database:
>             'MERGE (user:User { name:'tom' })', 'MERGE (tweet:Tweet { 
> tweet_id:'101' })'
>  
> We created an index on node attributes like 'name' of User node and 
> 'tweet_id' of Tweet node.
> We increased the 'open file descriptors' parameter to gain better 
> performance in Linux.
>  
> *Problems with Option 1*:
> Performance of checking uniqueness using 'MERGE' function dropped 
> dramatically with scale / over time. For example, it took 2.7 second to 
> insert 100 records when the database was empty. However, it took 62 seconds 
> to insert the same amount of data with 100,000 existing records.
>  
> *Option 2*: The other option we have tried is to check uniqueness 
> externally. That is, take all nodes and edges and create a hash table 
> outside Neo4J (e.g., in Python or Java) to check uniqueness. This is faster 
> than the earlier 'MERGE' function over time. However, it does not seem 
> elegant to have to extract existing nodes before each batch update. It 
> requires a read + write from the Neo4J database, instead of only a write.
>  
> We are wondering if there is an elegant solution for large data updating 
> in Neo4j. We feel this may be a common question for many users, and someone 
> may have previously encountered this and/or developed a robust solution.
>
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected] <javascript:>.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: [Neo4j] Scalable solution for updating graph data in Neo4j 2.0

Reply via email to