Hey,

I'm developing a prototype for a social network analysis product, using 
Neo4J 2.0.1 and Neo4JClient (.net)
I'm extremely happy with Neo4J and the Cypher query language, great work! 

Querying and analyzing existing data is executing at good rates and I do 
not see degradation as database size increases.

My problem is that my inserts are too slow for providing the expected user 
experience (even on an empty database) + inserts are getting slower as 
database size increase.

My dev machine is i7, 16GB, Samsung 840 SSD, Windows 7. Neo4J java heap is 
configured to take up 8GB (The DB size during tests is around 2GB)

When I encountered the problem I was using real data from Facebook but 
since then decided to switch to mockup data so I can reproduce the exact 
same scenario each time.
The basic mockup unit of data that I store is a "Post" that is composed of:
Post + Post Author + Comment1 + Comment1 Author + Comment2 + Comment2 
Author + Liker1 + Liker2 + LinkedMediaItem
So, each data unit I'm saving is 9 nodes + 8 relationships. (each node has 
4 short string properties, relationships have no properties)

I was able to insert 19,636 posts in 6330 seconds, that's 28 nodes/sec and 
31 rel/sec (~60 elements/sec)

This data insertion is NOT an initial data loading; it is a normal day to 
day action of the application and is initiated on a regular basis when a 
user runs a search on live social networks.
Since a normal search using the system is suppose to produce about this 
amount of posts (~20,000), 1.7 hours (6330s) is not a reasonable time for 
the user to wait for the data to be written to the DB.

- The 9 nodes and 8 relationships per Post are added using a *single* 
Cypher query with multiple Merge and CreateUnique clauses.
- All node are :labeled appropriately.
- I am using Neo4J 2.0 schema based indexes.
- I am using named parameters, no edit of Cypher strings.

I read everything I could find online regarding Neo4J inserts and was left 
confused. 
I fully acknowledge that getting the best performance is done using 
embedded mode, but for now I am stuck in .NET + I like Cypher and believe 
in HTTP access too much.
I think I know the answer for some of the questions below, but I would like 
to be reassured and also think it can be beneficial for many new comers to 
neo4j to have all these answered in one place, so here we go:

1. Is this a bad insert rate I am getting considering I am using Cypher 
over REST ? (I saw examples online with both worse and better rates)

2. Is there a chance that NOT creating all 9 nodes and 8 relationships of a 
single Post in one Cypher query will make it go faster ?

3. Is the *main reason* my inserts are relatively slow on an *empty DB* is 
because each request is using a transaction of its own, i.e. opening and 
closing a transaction on each request ?

4. "Batch 
Operations<http://docs.neo4j.org/chunked/stable/rest-api-batch-ops.html>" 
are executing all the queries sent under a single transaction, right ? So 
the 2 reasons why I should expect better performance when using Batch 
Operations are (assuming 1 batch):

A. Single REST call for all queries.

B. Single transaction for many queries.

Right?

 
5. How do I determine the proper batch size ? by testing? is the general 
recommended size is indeed ~20,000-50,000 
commands<https://groups.google.com/forum/#!topic/neo4j/YpeewfD8-Is>?

6. When using the Transactional HTTP 
endpoint<http://docs.neo4j.org/chunked/stable/rest-api-transactional.html> 
should I 
expect a better performance because all my queries are running under a 
single transaction ? (but unlike the batch operations I will have multiple 
HTTP calls, one per query ?)

7. Assuming I am using Neo4J 2.0.1, which route should I take to improve 
insert performance: batch operations or transaction endpoint ? any clear 
cut scenarios for using each one ?

8. I am aware of the Michael's CSV batch 
import<https://github.com/jexp/batch-import>, 
but as far as I can tell it is truly designed for initial load and is not 
suitable for day to day operation, right? 

9. I read Tatham (Neo4JClient) 
saying<http://hg.readify.net/neo4jclient/issue/5/batch-support>that "mutable 
Cypher makes batching rather redundant" because you can "Just 
send a single Cypher call that does all your creates, updates and deletes 
in one hit" , but that's not very practical coding-wise as it creates very 
large and complex queries that also fail when reaching certain size. I can 
create a single Post (9 nodes, 8 relationships) using this method, but I 
still remain with 1 REST call and 1 transaction per Post. 
Am I missing something regarding the ability to batch things under 1 
request/transaction using Neo4JClient ?

10. Do you know of any .NET library that implemented Batch Operations ? 

11. Except for 
Cypher.NET<http://mtranter.com/2013/09/21/cypher-net-a-neo4j-cypher-api/>, 
do you know of any other library that implemented the transactional 
endpoint? 

12. Is there a programmatic way for me to make sure all my Matches and 
Merges are using a schema index and not making a full scan ? (i.e. making 
sure I didn't miss creating an index)

13. At no point did I find that using parallel threads helps with write 
performance. I thought I was getting an improved write rate when adding 
my mockup Post data in 8 parallel loops like suggested 
here<http://architects.dzone.com/articles/intensive-analysis-neo4j-java> (kinda 
old), but obviously write speed just declined faster. looks like I need to 
solve the problem in a single thread first before considering increasing 
the parallelism. 
 
14. I read in the performance 
guide<http://docs.neo4j.org/chunked/stable/performance-guide.html> that: 
"to get maximum write performance when using Neo4j make sure the OS is 
configured not to write out any of the dirty pages caused by writes to the 
memory mapped regions of the store files"

If I run a ~2GB database on a 16GB machine with the following settings: 
neostore.nodestore.db.mapped_memory=4G
neostore.relationshipstore.db.mapped_memory=4G
neostore.propertystore.db.mapped_memory=2G
neostore.propertystore.db.strings.mapped_memory=2G
neostore.propertystore.db.arrays.mapped_memory=130M
wrapper.java.initmemory=4096
wrapper.java.maxmemory=8192

Can I assume the excessive writing of dirty pages is NOT why my write 
performance is low ?

What settings can I change in Neo4J or in Windows to avoid excessive dirty 
page writes ?

15. Are there any other "production-system quality" methods to insert many 
nodes at once?

16. As I mentioned, I am getting good performance reading and analyzing 
existing data and having gone through several other graph DBs before, I am 
very happy with the rapid development neo4j and cypher enables me to 
achieve. Yet, my colleagues are asking me why are we investing in Cypher 
REST instead of embedded mode; can I stand tall and explain to them that 
Neo4J team sees Cypher as "our future 
API<http://www.rene-pickhardt.de/get-the-full-neo4j-power-by-using-the-core-java-api-for-traversing-your-graph-data-base-instead-of-cypher-query-language/>"
 
? :-)

I'm already feeling much better by venting all these questions from my head 
to writing :-)
Thanks,
Ben.







-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to