Hello, I am setting up an experiment to measure the performance of Neo4j in a large-scale web-service backend type environment against a simplistic "twitter-like" workload, similar to the experiments performed by Aurelius for TitanDB published here<http://thinkaurelius.com/2012/08/06/titan-provides-real-time-big-graph-data/> in 2012, and I was hoping to ask some questions about the right hardware and software setup for Neo4j. Here are some of the details of the experiment I plan to run:
*Experiment Description* *Graph structure description*: Users and tweets are nodes, connected by "follows" (users following users), "tweet" (between a user and their personal tweets), and "stream" (between a user and the tweets of their friends) relationships, as depicted in this GraphGist<http://neo4j-console-20.herokuapp.com/r/f14yl0> *Initial dataset size*: 640M user nodes, ~20B properties per node, and 24B "follows" relationships. *Workload*: A simplistic Twitter "type" of workload, where simulated users read their tweet stream (~90% of the time), and publish tweets (~10% of the time). Reading tweet streams involves reading (in sorted order by timestamp) the most recent 10 tweets from their tweet stream. Publishing a tweet involves creating the tweet node in the graph with an edge to it from the tweeter, along with creating "stream" edges from the the user's followers to the tweet (see GraphGist above for visual). *Target Performance*: sustained ~2,000 write (publish tweet) transactions per second, ~20,000 read (read stream) transactions per second *Measurements*: Minimum, maximum, and average transaction latencies, including standard deviation and 99.9 percentile latencies, as well as achieved transaction throughput. *Servers Spec*: Intel Xeon X3470 3.6GHz Turbo (4 cores, 8 threads), 24GB DDR3 RAM, 2x128GB Crucial M4 (waiting for capacity upgrade) *Questions* 1. Approximately how many servers would I need to support the target performance? 2. I have read (http://docs.neo4j.org/chunked/stable/ha-haproxy.html) that an HA Proxy load balancer is recommended to spread the load across the available Neo4j slaves for reads, but the HA Proxy will add latency, and (I imagine) has a throughput limit (that probably gets worse with small transactions). Has the latency impact and throughput limitations of an HA Proxy server been measured before? 3. In a typical deployment for implementing a large scale website, would it typically be the case that the frontend servers would themselves figure out how to load balance requests across the backend data store, instead of needing to go through an HA Proxy (due to the latency impact and throughput limitations)? I ask because we would like to measure Neo4j in the sort of environment and setup it would be used in for this type of application. 4. Are there any standard or recommended consistency settings for this size of deployment? 5. Is there any recommended SSD hardware? Thanks for any help and information! Best, Jonathan -- You received this message because you are subscribed to the Google Groups "Neo4j" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
