[Neo4j] On relative performance of native querying in Java vs. Cypher querying.

costas . barbis Mon, 17 Mar 2014 09:11:43 -0700


*Author's note:* Even though this post seems to be partially delving into 
the technical characteristics of Cypher it has been created as an initiator 
for discussion on relative scalability and performance of native vs. cypher 
so I believed it should be placed in the group forum as opposed to stack 
overflow. If the coordinators disagree on this decision please move it to 
SO. Thank you. 
** neo4j version, library versions, OS, jdk:*


Neo4j Stable Release 2.0.1 [community edition], with relevant libraries.
OS: windows 7 (64 bit), service pack 1, 8gb ram, i5 quad core desktop 
(without hyper-threading).
jdk: java 7 (jdk1.7.0_51).
all applications are in java, using embedded database instance (java 
runtime and neo4j vm/other arguments are default (empty) in each case).

Fellow Neo4J enthusiasts,
After performing various simple benchmarks it would appear that using 
native java to query a neo graph outperforms doing the same using cypher. 
This would normally be surprising in a relational context but i can see how 
it can make sense in this graph context (as virtual memory is heavily used 
etc.). 
As such I would like to know if this is in fact the case for the majority 
of queries that can be made in neo4j, or whether I am missing something 
glaringly obvious in my tests.
 
*A summary of my test parameters is as follows:*

   - a graph of 100k nodes (each having one property) is used as a baseline 
   (relationships and more properties are added for later experiments).
   - after graph insertion the jvm is restarted before querying in order to 
   clear virtual memory.
   - the same query is performed using cypher and neo4j in each case.
   - the jvm is restarted for every subsequent query experiment to ensure 
   virtual memory is wiped (vm is embedded into java heap in windows by 
   default).
   - during each experiment 10 repetitions of the query are performed, with 
   each repetition having a small change in one of the query variables. This 
   is to test execution time for the first instance (hence before any nodes 
   are in virtual memory) as well as subsequent calls where most of the nodes 
   queried are in RAM.
   - execution time is recorded using System.nanotime() for accuracy.
   - the cypher queries have been optimised as much as possible (with my 
   current knowledge) using a single execution engine per experiment and the 
   changing variable is injected as a parameter to the query.

*Hypothesis:*
For all experiments (details below) both for initial query execution as 
well as subsequent executions, the native java api outperforms cypher.

*The following queries were performed (code presented below, this is a 
summary in text for conciseness):*

   - get me all nodes which have property 'i' equal to {max}. NB: The 
   variable {max} ranges from 0 to 100k in the nodes and the queries test all 
   possible ranges to ensure that the java execution does not have a 
   preferential treatment by finding results close to the top of the node file 
   (as is shown in the results).
   - get me all nodes which have property 'i' equal to {max}, through a 
   single indexed node. In my context (model-driven engineering research) it 
   is common to have a single (or very few) starting points for a query, so my 
   second test simulated this behaviour by creating a single "source" node 
   with relationship to the 100k nodes to be queried. as such, the query goes 
   to the lucene index to find the node and then traverses the relationship it 
   has to the 100k nodes in order to be executed. This also avoids using the 
   GlobalGraphOperations.at(database).getAllNodes() operation in java (which 
   is useful as it would never be used in my context).
   - get me all nodes which have property 'i' equal to {max} and 
   relationship named (of relationship type with name) {name} to another node. 
   this is a simple extension to the first query which uses a one-hop 
   traversal as well.
   - get me all nodes which have property 'i' equal to {max} and 'i2' equal 
   to {max2}.
   - get me all nodes which have property 'i' equal to {max} and 'i2' equal 
   to {max2} and relationship named {name} to another node. 

*Results:*

I will only present the detailed results of the first query as it would get 
tediously long to present them all (they are all included as a snippet link 
below), and as mentioned above, all of them seem to support the statement 
that java 

outperforms cypher.

*Query 1 results:*
Java (microseconds): 15415  result 'i': 1  testing equality on: 1
Java (microseconds): 288  result 'i': 2  testing equality on: 2
Java (microseconds): 333  result 'i': 3  testing equality on: 3
Java (microseconds): 304  result 'i': 4  testing equality on: 4
Java (microseconds): 303  result 'i': 5  testing equality on: 5
Java (microseconds): 319  result 'i': 6  testing equality on: 6
Java (microseconds): 331  result 'i': 7  testing equality on: 7
Java (microseconds): 355  result 'i': 8  testing equality on: 8
Java (microseconds): 368  result 'i': 9  testing equality on: 9
Java (microseconds): 385  result 'i': 10  testing equality on: 10

Java (microseconds): 175027  result 'i': 1000  testing equality on: 1000
Java (microseconds): 146228  result 'i': 2000  testing equality on: 2000
Java (microseconds): 126249  result 'i': 3000  testing equality on: 3000
Java (microseconds): 98282  result 'i': 4000  testing equality on: 4000
Java (microseconds): 69881  result 'i': 5000  testing equality on: 5000
Java (microseconds): 38536  result 'i': 6000  testing equality on: 6000
Java (microseconds): 24090  result 'i': 7000  testing equality on: 7000
Java (microseconds): 25140  result 'i': 8000  testing equality on: 8000
Java (microseconds): 25849  result 'i': 9000  testing equality on: 9000
Java (microseconds): 26664  result 'i': 10000  testing equality on: 10000

Java (microseconds): 1704711  result 'i': 99997  testing equality on: 99997
Java (microseconds): 119149  result 'i': 99998  testing equality on: 99998
Java (microseconds): 49827  result 'i': 99999  testing equality on: 99999
Java (microseconds): 60392  result 'i': -1   testing equality on: 100000
Java (microseconds): 42451  result 'i': -1   testing equality on: 100001
Java (microseconds): 35205  result 'i': -1   testing equality on: 100002
Java (microseconds): 36279  result 'i': -1   testing equality on: 100003
Java (microseconds): 34999  result 'i': -1   testing equality on: 100004
Java (microseconds): 35179  result 'i': -1   testing equality on: 100005
Java (microseconds): 45571  result 'i': -1   testing equality on: 100006

Cypher [prepared with 1 execution engine] (microseconds): 2688552  result 
'i': 100  testing equality on: 100
Cypher [prepared with 1 execution engine] (microseconds): 134839  result 
'i': 200  testing equality on: 200
Cypher [prepared with 1 execution engine] (microseconds): 116128  result 
'i': 300  testing equality on: 300
Cypher [prepared with 1 execution engine] (microseconds): 96070   result 
'i': 400  testing equality on: 400
Cypher [prepared with 1 execution engine] (microseconds): 111627  result 
'i': 500  testing equality on: 500
Cypher [prepared with 1 execution engine] (microseconds): 116955  result 
'i': 600  testing equality on: 600
Cypher [prepared with 1 execution engine] (microseconds): 98720   result 
'i': 700  testing equality on: 700
Cypher [prepared with 1 execution engine] (microseconds): 96051   result 
'i': 800  testing equality on: 800
Cypher [prepared with 1 execution engine] (microseconds): 106406  result 
'i': 900  testing equality on: 900
Cypher [prepared with 1 execution engine] (microseconds): 97068   result 
'i': 1000  testing equality on: 1000

Cypher [prepared with 1 execution engine] (microseconds): 2651371  result 
'i': 1000  testing equality on: 1000
Cypher [prepared with 1 execution engine] (microseconds): 121623  result 
'i': 2000  testing equality on: 2000
Cypher [prepared with 1 execution engine] (microseconds): 95211   result 
'i': 3000  testing equality on: 3000
Cypher [prepared with 1 execution engine] (microseconds): 79345   result 
'i': 4000  testing equality on: 4000
Cypher [prepared with 1 execution engine] (microseconds): 88915   result 
'i': 5000  testing equality on: 5000
Cypher [prepared with 1 execution engine] (microseconds): 100527  result 
'i': 6000  testing equality on: 6000
Cypher [prepared with 1 execution engine] (microseconds): 77890   result 
'i': 7000  testing equality on: 7000
Cypher [prepared with 1 execution engine] (microseconds): 77430   result 
'i': 8000  testing equality on: 8000
Cypher [prepared with 1 execution engine] (microseconds): 76451   result 
'i': 9000  testing equality on: 9000
Cypher [prepared with 1 execution engine] (microseconds): 86732   result 
'i': 10000  testing equality on: 10000

As we can clearly see, Java can "cheat" on low equality tests (as we break 
after finding the node as we assume (and know in our context) 'i' is 
unique) but more interestingly even when it fails (checking i > 100k) it is 
still roughly 2 

times as fast as cypher both for initial queries and subsequent ones (this 
is a good test for non-unique properties too as it forces java to iterate 
through all of the nodes present).
This shows two things in my view:
1) Java can optimise for unique results. As far as I am aware cypher cannot 
be told to stop when it finds a result we know is unique (such as an ISBN 
of a book for example or any other unique property in a node).
2) For non-unique results (or for a failed query) it is still faster than 
cypher.

After getting these results my curiosity prompted me to expand the scope by 
adding relationships and a second attribute to see if the same trend 
continues, and it did.

*Links to code snippets:*
first query:
https://gist.github.com/anonymous/9601553

second query:
https://gist.github.com/anonymous/9601556

third query:
https://gist.github.com/anonymous/9601571

fourth:
https://gist.github.com/anonymous/9601581

fifth:
https://gist.github.com/anonymous/9601590

entire result set link:
https://gist.github.com/anonymous/9601486

*Discussion:*

My main question is whether these results are to be expected (or even 
obvious to some) which would mean I will just use native java in my 
application.
If my results are wrong/misleading I would appreciate knowing why but if 
not, a discussion on how to improve cypher to attempt to close the gap may 
be useful.

Other notes and observations I had (non-conclusive as the tests performed 
were not as thorough as the above):
using 'count' in cypher seems to destroy execution time (whereas normally 
in sql it improves it).
adding depth to a cypher search (for example going from (a)-[]->(b) to 
(a)-[]->(b)-[]->(c)) seems to scale a lot worst in cypher than java.

Thank you for reading this wall,

Costas

 

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[Neo4j] On relative performance of native querying in Java vs. Cypher querying.

Reply via email to