Hi Guys, I have 4 nodes in my ES cluster. They are big boxes with 24GB memory and 32TB hard disk. ES is configured with 12GB and I have done extensive testing and I am happy with the actual implementation.
This ES cluster is connected to my Hadoop cluster with good 10GB connections throughout. The hadoop cluster has 12 nodes and I use logstash to move historical logs off the hadoop cluster to ES. Given these assumptions: - I have lots of disk space per machine, I do not expect to run out of disk space. - The user query load is very light. Used for adhoc research not production - I will have several years of data so was planning on one index per month. eg; logstash-2014.90 - I do not care too much about replication as all the data is on the hadoop cluster. On failure I will re-index Question: - How many shards should I aim for per index? I was thinking of FOUR per index on the assumption that it will be ONE shard per node. When I load the data from hadoop I do it via a streaming map-reduce using the logstash netcat route with 3 hadoop nodes pointing to 1 ES node. For this reason 1 shard per node seems a good idea? eg; hadoop1 streaming Mapper ----> logstash on hadoop ---->netcat-----> ES node1 hadoop2 streaming Mapper ----> logstash on hadoop ---->netcat hadoop3 streaming Mapper ----> logstash on hadoop ---->netcat hadoop4 streaming Mapper ----> logstash on hadoop ---->netcat-----> ES node2 hadoop5 streaming Mapper ----> logstash on hadoop ---->netcat hadoop6 streaming Mapper ----> logstash on hadoop ---->netcat ...and so on Thanks -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/33b83ebd-b601-409c-91f6-42444586bcc6%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
