[Neo4j] Write performance, batch operations, transactions, store files and more

Ben Gold Sun, 16 Mar 2014 19:34:24 -0700

Hey,

I'm developing a prototype for a social network analysis product, using 
Neo4J 2.0.1 and Neo4JClient (.net)
I'm extremely happy with Neo4J and the Cypher query language, great work!

Querying and analyzing existing data is executing at good rates and I do
not see degradation as database size increases.

My problem is that my inserts are too slow for providing the expected user
experience (even on an empty database) + inserts are getting slower as
database size increase.

My dev machine is i7, 16GB, Samsung 840 SSD, Windows 7. Neo4J java heap is
configured to take up 8GB (The DB size during tests is around 2GB)

When I encountered the problem I was using real data from Facebook but
since then decided to switch to mockup data so I can reproduce the exact
same scenario each time.
The basic mockup unit of data that I store is a "Post" that is composed of:
Post + Post Author + Comment1 + Comment1 Author + Comment2 + Comment2
Author + Liker1 + Liker2 + LinkedMediaItem
So, each data unit I'm saving is 9 nodes + 8 relationships. (each node has
4 short string properties, relationships have no properties)

I was able to insert 19,636 posts in 6330 seconds, that's 28 nodes/sec and
31 rel/sec (~60 elements/sec)

This data insertion is NOT an initial data loading; it is a normal day to
day action of the application and is initiated on a regular basis when a
user runs a search on live social networks.
Since a normal search using the system is suppose to produce about this
amount of posts (~20,000), 1.7 hours (6330s) is not a reasonable time for
the user to wait for the data to be written to the DB.

- The 9 nodes and 8 relationships per Post are added using a *single*
Cypher query with multiple Merge and CreateUnique clauses.
- All node are :labeled appropriately.
- I am using Neo4J 2.0 schema based indexes.
- I am using named parameters, no edit of Cypher strings.

I read everything I could find online regarding Neo4J inserts and was left
confused.
I fully acknowledge that getting the best performance is done using
embedded mode, but for now I am stuck in .NET + I like Cypher and believe
in HTTP access too much.
I think I know the answer for some of the questions below, but I would like
to be reassured and also think it can be beneficial for many new comers to
neo4j to have all these answered in one place, so here we go:

1. Is this a bad insert rate I am getting considering I am using Cypher
over REST ? (I saw examples online with both worse and better rates)

2. Is there a chance that NOT creating all 9 nodes and 8 relationships of a
single Post in one Cypher query will make it go faster ?

3. Is the *main reason* my inserts are relatively slow on an *empty DB* is
because each request is using a transaction of its own, i.e. opening and
closing a transaction on each request ?

4. "Batch
Operations<http://docs.neo4j.org/chunked/stable/rest-api-batch-ops.html>"
are executing all the queries sent under a single transaction, right ? So
the 2 reasons why I should expect better performance when using Batch
Operations are (assuming 1 batch):

A. Single REST call for all queries.

B. Single transaction for many queries.

Right?

5. How do I determine the proper batch size ? by testing? is the general
recommended size is indeed ~20,000-50,000
commands<https://groups.google.com/forum/#!topic/neo4j/YpeewfD8-Is>?

6. When using the Transactional HTTP
endpoint<http://docs.neo4j.org/chunked/stable/rest-api-transactional.html>
should I
expect a better performance because all my queries are running under a
single transaction ? (but unlike the batch operations I will have multiple
HTTP calls, one per query ?)

7. Assuming I am using Neo4J 2.0.1, which route should I take to improve
insert performance: batch operations or transaction endpoint ? any clear
cut scenarios for using each one ?

8. I am aware of the Michael's CSV batch
import<https://github.com/jexp/batch-import>,
but as far as I can tell it is truly designed for initial load and is not
suitable for day to day operation, right?

9. I read Tatham (Neo4JClient)
saying<http://hg.readify.net/neo4jclient/issue/5/batch-support>that "mutable
Cypher makes batching rather redundant" because you can "Just
send a single Cypher call that does all your creates, updates and deletes
in one hit" , but that's not very practical coding-wise as it creates very
large and complex queries that also fail when reaching certain size. I can
create a single Post (9 nodes, 8 relationships) using this method, but I
still remain with 1 REST call and 1 transaction per Post.
Am I missing something regarding the ability to batch things under 1
request/transaction using Neo4JClient ?

10. Do you know of any .NET library that implemented Batch Operations ?

11. Except for
Cypher.NET<http://mtranter.com/2013/09/21/cypher-net-a-neo4j-cypher-api/>,
do you know of any other library that implemented the transactional
endpoint?

12. Is there a programmatic way for me to make sure all my Matches and
Merges are using a schema index and not making a full scan ? (i.e. making
sure I didn't miss creating an index)

13. At no point did I find that using parallel threads helps with write
performance. I thought I was getting an improved write rate when adding
my mockup Post data in 8 parallel loops like suggested
here<http://architects.dzone.com/articles/intensive-analysis-neo4j-java> (kinda
old), but obviously write speed just declined faster. looks like I need to
solve the problem in a single thread first before considering increasing
the parallelism.

14. I read in the performance
guide<http://docs.neo4j.org/chunked/stable/performance-guide.html> that:
"to get maximum write performance when using Neo4j make sure the OS is
configured not to write out any of the dirty pages caused by writes to the
memory mapped regions of the store files"

If I run a ~2GB database on a 16GB machine with the following settings:
neostore.nodestore.db.mapped_memory=4G
neostore.relationshipstore.db.mapped_memory=4G
neostore.propertystore.db.mapped_memory=2G
neostore.propertystore.db.strings.mapped_memory=2G
neostore.propertystore.db.arrays.mapped_memory=130M
wrapper.java.initmemory=4096
wrapper.java.maxmemory=8192

Can I assume the excessive writing of dirty pages is NOT why my write
performance is low ?

What settings can I change in Neo4J or in Windows to avoid excessive dirty
page writes ?

15. Are there any other "production-system quality" methods to insert many
nodes at once?

16. As I mentioned, I am getting good performance reading and analyzing
existing data and having gone through several other graph DBs before, I am
very happy with the rapid development neo4j and cypher enables me to
achieve. Yet, my colleagues are asking me why are we investing in Cypher
REST instead of embedded mode; can I stand tall and explain to them that
Neo4J team sees Cypher as "our future
API<http://www.rene-pickhardt.de/get-the-full-neo4j-power-by-using-the-core-java-api-for-traversing-your-graph-data-base-instead-of-cypher-query-language/>"

? :-)

I'm already feeling much better by venting all these questions from my head
to writing :-)
Thanks,
Ben.

--
You received this message because you are subscribed to the Google Groups
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[Neo4j] Write performance, batch operations, transactions, store files and more

Reply via email to