Try disabling merge IO throttling, especially if your index is on SSD/s. (It's on by default at a paltry 20 MB/sec). Merge IO throttling causes merges to run slowly which eventually causes them to back up enough to the point where indexing must be throttled...
Also see the recent post about tuning to favor indexing throughput: http://www.elasticsearch.org/blog/performance-considerations-elasticsearch-indexing/ Mike McCandless http://blog.mikemccandless.com On Thu, Sep 18, 2014 at 4:54 AM, <[email protected]> wrote: > Setup: > 4 nodes > Replication = 0 > ES_HEAP_SIZE = 75GB > Number of Indices = 59 (using logstash one index per month) > Total shards = 234 (each index is 4 hards, one per node) > Total docs = 7.4 billion > Total size = 4.7TB > > When I add a new file, which I do using logstash on all four nodes, the > indexing immediately throttles. For instance: > > [2014-09-18 09:41:42,326][INFO ][index.engine.internal ] [hdp13] [ > logstash-2014.09][2] stop throttling indexing: numMergesInFlight=4, > maxNumMerges=5 > [2014-09-18 09:41:45,267][INFO ][index.engine.internal ] [hdp13] > [logstash-2014.09][2] now throttling indexing: numMergesInFlight=6, > maxNumMerges=5 > [2014-09-18 09:41:45,303][INFO ][index.engine.internal ] [hdp13] > [logstash-2014.09][2] stop throttling indexing: numMergesInFlight=4, > maxNumMerges=5 > [2014-09-18 09:41:51,273][INFO ][index.engine.internal ] [hdp13] > [logstash-2014.09][2] now throttling indexing: numMergesInFlight=6, > maxNumMerges=5 > [2014-09-18 09:41:51,379][INFO ][index.engine.internal ] [hdp13] > [logstash-2014.09][2] stop throttling indexing: numMergesInFlight=4, > maxNumMerges=5 > [2014-09-18 09:42:06,429][INFO ][index.engine.internal ] [hdp13] > [logstash-2014.09][2] now t > > Where should I be looking to tuning the indexing performance? The query > load on the cluster is very low as it is a research cluster and so I would > sacrifice query performance for indexing. > > The 4 nodes all run logstash, listening one various ports. I use netcat to > 'feed' the data to the 4 nodes from a hadoop cluster. > > hadoop1 netcat --------> > hadoop2 netcat --------> ES1 > hadoop3 netcat --------> > > And so on. > > Each ES node has 24 disks but I am only using one at the moment. This is > an obvious IO bottleneck, but I am unclear how to use all disks? If I add > more disks with ES share the data between them all? eg; /mnt/disk1 > /mnt/disk2 etc > > Thanks > > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/3e85d65c-8001-4f90-bfa0-f7e63679feba%40googlegroups.com > <https://groups.google.com/d/msgid/elasticsearch/3e85d65c-8001-4f90-bfa0-f7e63679feba%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAD7smRdJwXcsq%2BdUpyMZ%3D2UZsDbGwX7CEeE91L_rFan1FP6bDw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
