Dear Michael, Thank you for your detailed response and the multiple clarifications.
I was considering some of these concerns (but did not want to add an even longer amount of tests) but there were some (like the limit expression) I was not aware of at all. It has been very helpful in furthering my understanding of Neo. Cheers, On Monday, March 17, 2014 4:05:41 PM UTC, [email protected] wrote: > *Author's note:* Even though this post seems to be partially delving into > the technical characteristics of Cypher it has been created as an initiator > for discussion on relative scalability and performance of native vs. cypher > so I believed it should be placed in the group forum as opposed to stack > overflow. If the coordinators disagree on this decision please move it to > SO. Thank you. > ** neo4j version, library versions, OS, jdk:* > > Neo4j Stable Release 2.0.1 [community edition], with relevant libraries. > OS: windows 7 (64 bit), service pack 1, 8gb ram, i5 quad core desktop > (without hyper-threading). > jdk: java 7 (jdk1.7.0_51). > all applications are in java, using embedded database instance (java > runtime and neo4j vm/other arguments are default (empty) in each case). > > Fellow Neo4J enthusiasts, > After performing various simple benchmarks it would appear that using > native java to query a neo graph outperforms doing the same using cypher. > This would normally be surprising in a relational context but i can see > how it can make sense in this graph context (as virtual memory is heavily > used etc.). > As such I would like to know if this is in fact the case for the majority > of queries that can be made in neo4j, or whether I am missing something > glaringly obvious in my tests. > > *A summary of my test parameters is as follows:* > > - a graph of 100k nodes (each having one property) is used as a > baseline (relationships and more properties are added for later > experiments). > - after graph insertion the jvm is restarted before querying in order > to clear virtual memory. > - the same query is performed using cypher and neo4j in each case. > - the jvm is restarted for every subsequent query experiment to ensure > virtual memory is wiped (vm is embedded into java heap in windows by > default). > - during each experiment 10 repetitions of the query are performed, > with each repetition having a small change in one of the query variables. > This is to test execution time for the first instance (hence before any > nodes are in virtual memory) as well as subsequent calls where most of the > nodes queried are in RAM. > - execution time is recorded using System.nanotime() for accuracy. > - the cypher queries have been optimised as much as possible (with my > current knowledge) using a single execution engine per experiment and the > changing variable is injected as a parameter to the query. > > *Hypothesis:* > For all experiments (details below) both for initial query execution as > well as subsequent executions, the native java api outperforms cypher. > > *The following queries were performed (code presented below, this is a > summary in text for conciseness):* > > - get me all nodes which have property 'i' equal to {max}. NB: The > variable {max} ranges from 0 to 100k in the nodes and the queries test all > possible ranges to ensure that the java execution does not have a > preferential treatment by finding results close to the top of the node > file > (as is shown in the results). > - get me all nodes which have property 'i' equal to {max}, through a > single indexed node. In my context (model-driven engineering research) it > is common to have a single (or very few) starting points for a query, so > my > second test simulated this behaviour by creating a single "source" node > with relationship to the 100k nodes to be queried. as such, the query goes > to the lucene index to find the node and then traverses the relationship > it > has to the 100k nodes in order to be executed. This also avoids using the > GlobalGraphOperations.at(database).getAllNodes() operation in java (which > is useful as it would never be used in my context). > - get me all nodes which have property 'i' equal to {max} and > relationship named (of relationship type with name) {name} to another > node. > this is a simple extension to the first query which uses a one-hop > traversal as well. > - get me all nodes which have property 'i' equal to {max} and 'i2' > equal to {max2}. > - get me all nodes which have property 'i' equal to {max} and 'i2' > equal to {max2} and relationship named {name} to another node. > > *Results:* > > I will only present the detailed results of the first query as it would > get tediously long to present them all (they are all included as a snippet > link below), and as mentioned above, all of them seem to support the > statement that java > > outperforms cypher. > > *Query 1 results:* > Java (microseconds): 15415 result 'i': 1 testing equality on: 1 > Java (microseconds): 288 result 'i': 2 testing equality on: 2 > Java (microseconds): 333 result 'i': 3 testing equality on: 3 > Java (microseconds): 304 result 'i': 4 testing equality on: 4 > Java (microseconds): 303 result 'i': 5 testing equality on: 5 > Java (microseconds): 319 result 'i': 6 testing equality on: 6 > Java (microseconds): 331 result 'i': 7 testing equality on: 7 > Java (microseconds): 355 result 'i': 8 testing equality on: 8 > Java (microseconds): 368 result 'i': 9 testing equality on: 9 > Java (microseconds): 385 result 'i': 10 testing equality on: 10 > > Java (microseconds): 175027 result 'i': 1000 testing equality on: 1000 > Java (microseconds): 146228 result 'i': 2000 testing equality on: 2000 > Java (microseconds): 126249 result 'i': 3000 testing equality on: 3000 > Java (microseconds): 98282 result 'i': 4000 testing equality on: 4000 > Java (microseconds): 69881 result 'i': 5000 testing equality on: 5000 > Java (microseconds): 38536 result 'i': 6000 testing equality on: 6000 > Java (microseconds): 24090 result 'i': 7000 testing equality on: 7000 > Java (microseconds): 25140 result 'i': 8000 testing equality on: 8000 > Java (microseconds): 25849 result 'i': 9000 testing equality on: 9000 > Java (microseconds): 26664 result 'i': 10000 testing equality on: 10000 > > Java (microseconds): 1704711 result 'i': 99997 testing equality on: 99997 > Java (microseconds): 119149 result 'i': 99998 testing equality on: 99998 > Java (microseconds): 49827 result 'i': 99999 testing equality on: 99999 > Java (microseconds): 60392 result 'i': -1 testing equality on: 100000 > Java (microseconds): 42451 result 'i': -1 testing equality on: 100001 > Java (microseconds): 35205 result 'i': -1 testing equality on: 100002 > Java (microseconds): 36279 result 'i': -1 testing equality on: 100003 > Java (microseconds): 34999 result 'i': -1 testing equality on: 100004 > Java (microseconds): 35179 result 'i': -1 testing equality on: 100005 > Java (microseconds): 45571 result 'i': -1 testing equality on: 100006 > > Cypher [prepared with 1 execution engine] (microseconds): 2688552 result > 'i': 100 testing equality on: 100 > Cypher [prepared with 1 execution engine] (microseconds): 134839 result > 'i': 200 testing equality on: 200 > Cypher [prepared with 1 execution engine] (microseconds): 116128 result > 'i': 300 testing equality on: 300 > Cypher [prepared with 1 execution engine] (microseconds): 96070 result > 'i': 400 testing equality on: 400 > Cypher [prepared with 1 execution engine] (microseconds): 111627 result > 'i': 500 testing equality on: 500 > Cypher [prepared with 1 execution engine] (microseconds): 116955 result > 'i': 600 testing equality on: 600 > Cypher [prepared with 1 execution engine] (microseconds): 98720 result > 'i': 700 testing equality on: 700 > Cypher [prepared with 1 execution engine] (microseconds): 96051 result > 'i': 800 testing equality on: 800 > Cypher [prepared with 1 execution engine] (microseconds): 106406 result > 'i': 900 testing equality on: 900 > Cypher [prepared with 1 execution engine] (microseconds): 97068 result > 'i': 1000 testing equality on: 1000 > > Cypher [prepared with 1 execution engine] (microseconds): 2651371 result > 'i': 1000 testing equality on: 1000 > Cypher [prepared with 1 execution engine] (microseconds): 121623 result > 'i': 2000 testing equality on: 2000 > Cypher [prepared with 1 execution engine] (microseconds): 95211 result > 'i': 3000 testing equality on: 3000 > Cypher [prepared with 1 execution engine] (microseconds): 79345 result > 'i': 4000 testing equality on: 4000 > Cypher [prepared with 1 execution engine] (microseconds): 88915 result > 'i': 5000 testing equality on: 5000 > Cypher [prepared with 1 execution engine] (microseconds): 100527 result > 'i': 6000 testing equality on: 6000 > Cypher [prepared with 1 execution engine] (microseconds): 77890 result > 'i': 7000 testing equality on: 7000 > Cypher [prepared with 1 execution engine] (microseconds): 77430 result > 'i': 8000 testing equality on: 8000 > Cypher [prepared with 1 execution engine] (microseconds): 76451 result > 'i': 9000 testing equality on: 9000 > Cypher [prepared with 1 execution engine] (microseconds): 86732 result > 'i': 10000 testing equality on: 10000 > > As we can clearly see, Java can "cheat" on low equality tests (as we break > after finding the node as we assume (and know in our context) 'i' is > unique) but more interestingly even when it fails (checking i > 100k) it is > still roughly 2 > > times as fast as cypher both for initial queries and subsequent ones (this > is a good test for non-unique properties too as it forces java to iterate > through all of the nodes present). > This shows two things in my view: > 1) Java can optimise for unique results. As far as I am aware cypher > cannot be told to stop when it finds a result we know is unique (such as an > ISBN of a book for example or any other unique property in a node). > 2) For non-unique results (or for a failed query) it is still faster than > cypher. > > After getting these results my curiosity prompted me to expand the scope > by adding relationships and a second attribute to see if the same trend > continues, and it did. > > *Links to code snippets:* > first query: > https://gist.github.com/anonymous/9601553 > > second query: > https://gist.github.com/anonymous/9601556 > > third query: > https://gist.github.com/anonymous/9601571 > > fourth: > https://gist.github.com/anonymous/9601581 > > fifth: > https://gist.github.com/anonymous/9601590 > > entire result set link: > https://gist.github.com/anonymous/9601486 > > *Discussion:* > > My main question is whether these results are to be expected (or even > obvious to some) which would mean I will just use native java in my > application. > If my results are wrong/misleading I would appreciate knowing why but if > not, a discussion on how to improve cypher to attempt to close the gap may > be useful. > > Other notes and observations I had (non-conclusive as the tests performed > were not as thorough as the above): > using 'count' in cypher seems to destroy execution time (whereas normally > in sql it improves it). > adding depth to a cypher search (for example going from (a)-[]->(b) to > (a)-[]->(b)-[]->(c)) seems to scale a lot worst in cypher than java. > > Thank you for reading this wall, > > Costas > > > > -- You received this message because you are subscribed to the Google Groups "Neo4j" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
