Hey, I'm developing a prototype for a social network analysis product, using Neo4J 2.0.1 and Neo4JClient (.net) I'm extremely happy with Neo4J and the Cypher query language, great work!
Querying and analyzing existing data is executing at good rates and I do not see degradation as database size increases. My problem is that my inserts are too slow for providing the expected user experience (even on an empty database) + inserts are getting slower as database size increase. My dev machine is i7, 16GB, Samsung 840 SSD, Windows 7. Neo4J java heap is configured to take up 8GB (The DB size during tests is around 2GB) When I encountered the problem I was using real data from Facebook but since then decided to switch to mockup data so I can reproduce the exact same scenario each time. The basic mockup unit of data that I store is a "Post" that is composed of: Post + Post Author + Comment1 + Comment1 Author + Comment2 + Comment2 Author + Liker1 + Liker2 + LinkedMediaItem So, each data unit I'm saving is 9 nodes + 8 relationships. (each node has 4 short string properties, relationships have no properties) I was able to insert 19,636 posts in 6330 seconds, that's 28 nodes/sec and 31 rel/sec (~60 elements/sec) This data insertion is NOT an initial data loading; it is a normal day to day action of the application and is initiated on a regular basis when a user runs a search on live social networks. Since a normal search using the system is suppose to produce about this amount of posts (~20,000), 1.7 hours (6330s) is not a reasonable time for the user to wait for the data to be written to the DB. - The 9 nodes and 8 relationships per Post are added using a *single* Cypher query with multiple Merge and CreateUnique clauses. - All node are :labeled appropriately. - I am using Neo4J 2.0 schema based indexes. - I am using named parameters, no edit of Cypher strings. I read everything I could find online regarding Neo4J inserts and was left confused. I fully acknowledge that getting the best performance is done using embedded mode, but for now I am stuck in .NET + I like Cypher and believe in HTTP access too much. I think I know the answer for some of the questions below, but I would like to be reassured and also think it can be beneficial for many new comers to neo4j to have all these answered in one place, so here we go: 1. Is this a bad insert rate I am getting considering I am using Cypher over REST ? (I saw examples online with both worse and better rates) 2. Is there a chance that NOT creating all 9 nodes and 8 relationships of a single Post in one Cypher query will make it go faster ? 3. Is the *main reason* my inserts are relatively slow on an *empty DB* is because each request is using a transaction of its own, i.e. opening and closing a transaction on each request ? 4. "Batch Operations<http://docs.neo4j.org/chunked/stable/rest-api-batch-ops.html>" are executing all the queries sent under a single transaction, right ? So the 2 reasons why I should expect better performance when using Batch Operations are (assuming 1 batch): A. Single REST call for all queries. B. Single transaction for many queries. Right? 5. How do I determine the proper batch size ? by testing? is the general recommended size is indeed ~20,000-50,000 commands<https://groups.google.com/forum/#!topic/neo4j/YpeewfD8-Is>? 6. When using the Transactional HTTP endpoint<http://docs.neo4j.org/chunked/stable/rest-api-transactional.html> should I expect a better performance because all my queries are running under a single transaction ? (but unlike the batch operations I will have multiple HTTP calls, one per query ?) 7. Assuming I am using Neo4J 2.0.1, which route should I take to improve insert performance: batch operations or transaction endpoint ? any clear cut scenarios for using each one ? 8. I am aware of the Michael's CSV batch import<https://github.com/jexp/batch-import>, but as far as I can tell it is truly designed for initial load and is not suitable for day to day operation, right? 9. I read Tatham (Neo4JClient) saying<http://hg.readify.net/neo4jclient/issue/5/batch-support>that "mutable Cypher makes batching rather redundant" because you can "Just send a single Cypher call that does all your creates, updates and deletes in one hit" , but that's not very practical coding-wise as it creates very large and complex queries that also fail when reaching certain size. I can create a single Post (9 nodes, 8 relationships) using this method, but I still remain with 1 REST call and 1 transaction per Post. Am I missing something regarding the ability to batch things under 1 request/transaction using Neo4JClient ? 10. Do you know of any .NET library that implemented Batch Operations ? 11. Except for Cypher.NET<http://mtranter.com/2013/09/21/cypher-net-a-neo4j-cypher-api/>, do you know of any other library that implemented the transactional endpoint? 12. Is there a programmatic way for me to make sure all my Matches and Merges are using a schema index and not making a full scan ? (i.e. making sure I didn't miss creating an index) 13. At no point did I find that using parallel threads helps with write performance. I thought I was getting an improved write rate when adding my mockup Post data in 8 parallel loops like suggested here<http://architects.dzone.com/articles/intensive-analysis-neo4j-java> (kinda old), but obviously write speed just declined faster. looks like I need to solve the problem in a single thread first before considering increasing the parallelism. 14. I read in the performance guide<http://docs.neo4j.org/chunked/stable/performance-guide.html> that: "to get maximum write performance when using Neo4j make sure the OS is configured not to write out any of the dirty pages caused by writes to the memory mapped regions of the store files" If I run a ~2GB database on a 16GB machine with the following settings: neostore.nodestore.db.mapped_memory=4G neostore.relationshipstore.db.mapped_memory=4G neostore.propertystore.db.mapped_memory=2G neostore.propertystore.db.strings.mapped_memory=2G neostore.propertystore.db.arrays.mapped_memory=130M wrapper.java.initmemory=4096 wrapper.java.maxmemory=8192 Can I assume the excessive writing of dirty pages is NOT why my write performance is low ? What settings can I change in Neo4J or in Windows to avoid excessive dirty page writes ? 15. Are there any other "production-system quality" methods to insert many nodes at once? 16. As I mentioned, I am getting good performance reading and analyzing existing data and having gone through several other graph DBs before, I am very happy with the rapid development neo4j and cypher enables me to achieve. Yet, my colleagues are asking me why are we investing in Cypher REST instead of embedded mode; can I stand tall and explain to them that Neo4J team sees Cypher as "our future API<http://www.rene-pickhardt.de/get-the-full-neo4j-power-by-using-the-core-java-api-for-traversing-your-graph-data-base-instead-of-cypher-query-language/>" ? :-) I'm already feeling much better by venting all these questions from my head to writing :-) Thanks, Ben. -- You received this message because you are subscribed to the Google Groups "Neo4j" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
