Hello,

I am setting up an experiment to measure the performance of Neo4j in a 
large-scale web-service backend type environment against a simplistic 
"twitter-like" workload, similar to the experiments performed by Aurelius 
for TitanDB published 
here<http://thinkaurelius.com/2012/08/06/titan-provides-real-time-big-graph-data/>
 in 
2012, and I was hoping to ask some questions about the right hardware and 
software setup for Neo4j. Here are some of the details of the experiment I 
plan to run:

*Experiment Description*
*Graph structure description*: Users and tweets are nodes, connected by 
"follows" (users following users), "tweet" (between a user and their 
personal tweets), and "stream" (between a user and the tweets of their 
friends) relationships, as depicted in this 
GraphGist<http://neo4j-console-20.herokuapp.com/r/f14yl0>

*Initial dataset size*: 640M user nodes, ~20B properties per node, and 24B 
"follows" relationships. 

*Workload*: A simplistic Twitter "type" of workload, where simulated users 
read their tweet stream (~90% of the time), and publish tweets (~10% of the 
time). Reading tweet streams involves reading (in sorted order by 
timestamp) the most recent 10 tweets from their tweet stream. Publishing a 
tweet involves creating the tweet node in the graph with an edge to it from 
the tweeter, along with creating "stream" edges from the the user's 
followers to the tweet (see GraphGist above for visual).

*Target Performance*: sustained ~2,000 write (publish tweet) transactions 
per second, ~20,000 read (read stream) transactions per second

*Measurements*: Minimum, maximum, and average transaction latencies, 
including standard deviation and 99.9 percentile latencies, as well as 
achieved transaction throughput.

*Servers Spec*: Intel Xeon X3470 3.6GHz Turbo (4 cores, 8 threads), 24GB 
DDR3 RAM, 2x128GB Crucial M4 (waiting for capacity upgrade)

*Questions*

   1. Approximately how many servers would I need to support the target 
   performance?
   2. I have read (http://docs.neo4j.org/chunked/stable/ha-haproxy.html) 
   that an HA Proxy load balancer is recommended to spread the load across the 
   available Neo4j slaves for reads, but the HA Proxy will add latency, and (I 
   imagine) has a throughput limit (that probably gets worse with small 
   transactions). Has the latency impact and throughput limitations of an HA 
   Proxy server been measured before?
   3. In a typical deployment for implementing a large scale website, would 
   it typically be the case that the frontend servers would themselves figure 
   out how to load balance requests across the backend data store, instead of 
   needing to go through an HA Proxy (due to the latency impact and throughput 
   limitations)? I ask because we would like to measure Neo4j in the sort of 
   environment and setup it would be used in for this type of application.
   4. Are there any standard or recommended consistency settings for this 
   size of deployment?
   5. Is there any recommended SSD hardware?

Thanks for any help and information!

Best,
Jonathan

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to