*Author's note:* Even though this post seems to be partially delving into
the technical characteristics of Cypher it has been created as an initiator
for discussion on relative scalability and performance of native vs. cypher
so I believed it should be placed in the group forum as opposed to stack
overflow. If the coordinators disagree on this decision please move it to
SO. Thank you.
** neo4j version, library versions, OS, jdk:*
Neo4j Stable Release 2.0.1 [community edition], with relevant libraries.
OS: windows 7 (64 bit), service pack 1, 8gb ram, i5 quad core desktop
(without hyper-threading).
jdk: java 7 (jdk1.7.0_51).
all applications are in java, using embedded database instance (java
runtime and neo4j vm/other arguments are default (empty) in each case).
Fellow Neo4J enthusiasts,
After performing various simple benchmarks it would appear that using
native java to query a neo graph outperforms doing the same using cypher.
This would normally be surprising in a relational context but i can see how
it can make sense in this graph context (as virtual memory is heavily used
etc.).
As such I would like to know if this is in fact the case for the majority
of queries that can be made in neo4j, or whether I am missing something
glaringly obvious in my tests.
*A summary of my test parameters is as follows:*
- a graph of 100k nodes (each having one property) is used as a baseline
(relationships and more properties are added for later experiments).
- after graph insertion the jvm is restarted before querying in order to
clear virtual memory.
- the same query is performed using cypher and neo4j in each case.
- the jvm is restarted for every subsequent query experiment to ensure
virtual memory is wiped (vm is embedded into java heap in windows by
default).
- during each experiment 10 repetitions of the query are performed, with
each repetition having a small change in one of the query variables. This
is to test execution time for the first instance (hence before any nodes
are in virtual memory) as well as subsequent calls where most of the nodes
queried are in RAM.
- execution time is recorded using System.nanotime() for accuracy.
- the cypher queries have been optimised as much as possible (with my
current knowledge) using a single execution engine per experiment and the
changing variable is injected as a parameter to the query.
*Hypothesis:*
For all experiments (details below) both for initial query execution as
well as subsequent executions, the native java api outperforms cypher.
*The following queries were performed (code presented below, this is a
summary in text for conciseness):*
- get me all nodes which have property 'i' equal to {max}. NB: The
variable {max} ranges from 0 to 100k in the nodes and the queries test all
possible ranges to ensure that the java execution does not have a
preferential treatment by finding results close to the top of the node file
(as is shown in the results).
- get me all nodes which have property 'i' equal to {max}, through a
single indexed node. In my context (model-driven engineering research) it
is common to have a single (or very few) starting points for a query, so my
second test simulated this behaviour by creating a single "source" node
with relationship to the 100k nodes to be queried. as such, the query goes
to the lucene index to find the node and then traverses the relationship it
has to the 100k nodes in order to be executed. This also avoids using the
GlobalGraphOperations.at(database).getAllNodes() operation in java (which
is useful as it would never be used in my context).
- get me all nodes which have property 'i' equal to {max} and
relationship named (of relationship type with name) {name} to another node.
this is a simple extension to the first query which uses a one-hop
traversal as well.
- get me all nodes which have property 'i' equal to {max} and 'i2' equal
to {max2}.
- get me all nodes which have property 'i' equal to {max} and 'i2' equal
to {max2} and relationship named {name} to another node.
*Results:*
I will only present the detailed results of the first query as it would get
tediously long to present them all (they are all included as a snippet link
below), and as mentioned above, all of them seem to support the statement
that java
outperforms cypher.
*Query 1 results:*
Java (microseconds): 15415 result 'i': 1 testing equality on: 1
Java (microseconds): 288 result 'i': 2 testing equality on: 2
Java (microseconds): 333 result 'i': 3 testing equality on: 3
Java (microseconds): 304 result 'i': 4 testing equality on: 4
Java (microseconds): 303 result 'i': 5 testing equality on: 5
Java (microseconds): 319 result 'i': 6 testing equality on: 6
Java (microseconds): 331 result 'i': 7 testing equality on: 7
Java (microseconds): 355 result 'i': 8 testing equality on: 8
Java (microseconds): 368 result 'i': 9 testing equality on: 9
Java (microseconds): 385 result 'i': 10 testing equality on: 10
Java (microseconds): 175027 result 'i': 1000 testing equality on: 1000
Java (microseconds): 146228 result 'i': 2000 testing equality on: 2000
Java (microseconds): 126249 result 'i': 3000 testing equality on: 3000
Java (microseconds): 98282 result 'i': 4000 testing equality on: 4000
Java (microseconds): 69881 result 'i': 5000 testing equality on: 5000
Java (microseconds): 38536 result 'i': 6000 testing equality on: 6000
Java (microseconds): 24090 result 'i': 7000 testing equality on: 7000
Java (microseconds): 25140 result 'i': 8000 testing equality on: 8000
Java (microseconds): 25849 result 'i': 9000 testing equality on: 9000
Java (microseconds): 26664 result 'i': 10000 testing equality on: 10000
Java (microseconds): 1704711 result 'i': 99997 testing equality on: 99997
Java (microseconds): 119149 result 'i': 99998 testing equality on: 99998
Java (microseconds): 49827 result 'i': 99999 testing equality on: 99999
Java (microseconds): 60392 result 'i': -1 testing equality on: 100000
Java (microseconds): 42451 result 'i': -1 testing equality on: 100001
Java (microseconds): 35205 result 'i': -1 testing equality on: 100002
Java (microseconds): 36279 result 'i': -1 testing equality on: 100003
Java (microseconds): 34999 result 'i': -1 testing equality on: 100004
Java (microseconds): 35179 result 'i': -1 testing equality on: 100005
Java (microseconds): 45571 result 'i': -1 testing equality on: 100006
Cypher [prepared with 1 execution engine] (microseconds): 2688552 result
'i': 100 testing equality on: 100
Cypher [prepared with 1 execution engine] (microseconds): 134839 result
'i': 200 testing equality on: 200
Cypher [prepared with 1 execution engine] (microseconds): 116128 result
'i': 300 testing equality on: 300
Cypher [prepared with 1 execution engine] (microseconds): 96070 result
'i': 400 testing equality on: 400
Cypher [prepared with 1 execution engine] (microseconds): 111627 result
'i': 500 testing equality on: 500
Cypher [prepared with 1 execution engine] (microseconds): 116955 result
'i': 600 testing equality on: 600
Cypher [prepared with 1 execution engine] (microseconds): 98720 result
'i': 700 testing equality on: 700
Cypher [prepared with 1 execution engine] (microseconds): 96051 result
'i': 800 testing equality on: 800
Cypher [prepared with 1 execution engine] (microseconds): 106406 result
'i': 900 testing equality on: 900
Cypher [prepared with 1 execution engine] (microseconds): 97068 result
'i': 1000 testing equality on: 1000
Cypher [prepared with 1 execution engine] (microseconds): 2651371 result
'i': 1000 testing equality on: 1000
Cypher [prepared with 1 execution engine] (microseconds): 121623 result
'i': 2000 testing equality on: 2000
Cypher [prepared with 1 execution engine] (microseconds): 95211 result
'i': 3000 testing equality on: 3000
Cypher [prepared with 1 execution engine] (microseconds): 79345 result
'i': 4000 testing equality on: 4000
Cypher [prepared with 1 execution engine] (microseconds): 88915 result
'i': 5000 testing equality on: 5000
Cypher [prepared with 1 execution engine] (microseconds): 100527 result
'i': 6000 testing equality on: 6000
Cypher [prepared with 1 execution engine] (microseconds): 77890 result
'i': 7000 testing equality on: 7000
Cypher [prepared with 1 execution engine] (microseconds): 77430 result
'i': 8000 testing equality on: 8000
Cypher [prepared with 1 execution engine] (microseconds): 76451 result
'i': 9000 testing equality on: 9000
Cypher [prepared with 1 execution engine] (microseconds): 86732 result
'i': 10000 testing equality on: 10000
As we can clearly see, Java can "cheat" on low equality tests (as we break
after finding the node as we assume (and know in our context) 'i' is
unique) but more interestingly even when it fails (checking i > 100k) it is
still roughly 2
times as fast as cypher both for initial queries and subsequent ones (this
is a good test for non-unique properties too as it forces java to iterate
through all of the nodes present).
This shows two things in my view:
1) Java can optimise for unique results. As far as I am aware cypher cannot
be told to stop when it finds a result we know is unique (such as an ISBN
of a book for example or any other unique property in a node).
2) For non-unique results (or for a failed query) it is still faster than
cypher.
After getting these results my curiosity prompted me to expand the scope by
adding relationships and a second attribute to see if the same trend
continues, and it did.
*Links to code snippets:*
first query:
https://gist.github.com/anonymous/9601553
second query:
https://gist.github.com/anonymous/9601556
third query:
https://gist.github.com/anonymous/9601571
fourth:
https://gist.github.com/anonymous/9601581
fifth:
https://gist.github.com/anonymous/9601590
entire result set link:
https://gist.github.com/anonymous/9601486
*Discussion:*
My main question is whether these results are to be expected (or even
obvious to some) which would mean I will just use native java in my
application.
If my results are wrong/misleading I would appreciate knowing why but if
not, a discussion on how to improve cypher to attempt to close the gap may
be useful.
Other notes and observations I had (non-conclusive as the tests performed
were not as thorough as the above):
using 'count' in cypher seems to destroy execution time (whereas normally
in sql it improves it).
adding depth to a cypher search (for example going from (a)-[]->(b) to
(a)-[]->(b)-[]->(c)) seems to scale a lot worst in cypher than java.
Thank you for reading this wall,
Costas
--
You received this message because you are subscribed to the Google Groups
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.