Hi everyone,

I have a use case that appears to expose a memory leak in Neo4j. I've been 
testing this with Neo4j 2.1.2 on OSX.

I've created a test case that reproduces the issue consistently and mimics 
the behavior of my real-world application. 
https://gist.github.com/mlaldrid/85a03fc022170561b807 This uses py2neo to 
interface with Neo4j's Cypher transactional HTTP endpoint. To force the 
suspected memory leak behavior to surface more quickly, I limit Neo4j's max 
heap to 1GB. In practice I tend to use an 8GB heap, but the misbehavior 
still occurs (albeit delayed).

In my real-world application, we need to CREATE millions of primary nodes 
of interest and MERGE ancillary nodes into the graph, as they can be shared 
by any number of other primary nodes. In the test case here we give the 
primary nodes the Person label, and the ancillary nodes are labeled Address 
and Phone. A fixed set of Address and Phone nodes are generated and 
randomly attached to Person nodes.

Each Cypher transaction CREATEs 1000 Person nodes and MERGEs in 2 Address 
and 1 Phone node for each Person. The transactions are created and then 
committed without any intermediate executions of the open transaction.

This log demonstrates increasingly poor load performance until finally 
Neo4j runs out of heap space and fails a transaction:

% time python neo4j_heap_stress.py
2014-07-01 17:30:07,596 :: __main__ :: Generating fake data ...
2014-07-01 17:30:31,430 :: __main__ :: Creating label indices ...
2014-07-01 17:30:31,992 :: __main__ :: Beginning load ...
2014-07-01 17:33:33,949 :: __main__ :: Finished batch 100
2014-07-01 17:35:49,346 :: __main__ :: Finished batch 200
2014-07-01 17:37:56,856 :: __main__ :: Finished batch 300
2014-07-01 17:40:01,333 :: __main__ :: Finished batch 400
2014-07-01 17:42:04,855 :: __main__ :: Finished batch 500
2014-07-01 17:44:11,104 :: __main__ :: Finished batch 600
2014-07-01 17:46:17,261 :: __main__ :: Finished batch 700
2014-07-01 17:48:21,778 :: __main__ :: Finished batch 800
2014-07-01 17:50:28,206 :: __main__ :: Finished batch 900
2014-07-01 17:52:39,424 :: __main__ :: Finished batch 1000
2014-07-01 17:54:56,618 :: __main__ :: Finished batch 1100
2014-07-01 17:57:22,797 :: __main__ :: Finished batch 1200
2014-07-01 18:02:27,327 :: __main__ :: Finished batch 1300
2014-07-01 18:12:35,143 :: __main__ :: Finished batch 1400
2014-07-01 18:24:16,126 :: __main__ :: Finished batch 1500
2014-07-01 18:38:25,835 :: __main__ :: Finished batch 1600
2014-07-01 18:56:18,826 :: __main__ :: Finished batch 1700
2014-07-01 19:22:00,779 :: __main__ :: Finished batch 1800
2014-07-01 20:17:18,317 :: __main__ :: Finished batch 1900
Traceback (most recent call last):
  File "neo4j_heap_stress.py", line 112, in <module>
    main()
  File "neo4j_heap_stress.py", line 34, in main
    load_batch(fake, addresses, phones)
  File "neo4j_heap_stress.py", line 68, in load_batch
    tx.commit()
  File 
"/Users/matt/.virtualenvs/dataenv/lib/python2.7/site-packages/py2neo/cypher.py",
 
line 242, in commit
    return self._post(self._commit or self._begin_commit)
  File 
"/Users/matt/.virtualenvs/dataenv/lib/python2.7/site-packages/py2neo/cypher.py",
 
line 217, in _post
    raise TransactionError.new(error["code"], error["message"])
py2neo.cypher.CouldNotCommit: commit threw exception
python neo4j_heap_stress.py  1351.47s user 48.76s system 10% cpu 3:46:27.77 
total

Early in the load process, 100K Person nodes are inserted approximately 
every 2 minutes. However, after 1.2M Person nodes (batch 1200) have been 
inserted into the graph the batch times slow dramatically. This corresponds 
with increased Java garbage collection times--near the end the GC suspends 
application threads for several seconds at a time. I have captured the GC 
logs for Neo4j if there is any interest in seeing that.

Eventually a transaction fails, and this is in console.log:

Exception in thread "GC-Monitor"
Exception: java.lang.OutOfMemoryError thrown from the 
UncaughtExceptionHandler in thread "GC-Monitor"
Exception in thread "DateCache"
Exception: java.lang.OutOfMemoryError thrown from the 
UncaughtExceptionHandler in thread "DateCache"
Exception in thread "RMI RenewClean-[192.168.1.214:60253]" 
java.lang.OutOfMemoryError: Java heap space

Of all the variables here, using many MERGE queries appears to me to be the 
culprit behind the degradation in performance. Prior to changing my 
application's graph model to use shared ancillary nodes among the primary 
nodes I was just doing a simple CREATE for all nodes. That scaled nearly 
linearly to tens of millions of nodes in the graph. That was also using the 
same cypher transaction method via py2neo. Only when I wanted to 
de-duplicate the shared ancillary nodes with MERGE did this behavior arise. 
Hopefully that's not a red herring, but something is almost certainly amiss 
here.

Please let me know if I can provide any other relevant details to reproduce.

Thanks,
-Matt

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to