Re: Design advice for ES side-by-side with hadoop cluster?

Costin Leau Mon, 01 Sep 2014 09:18:06 -0700

On 9/1/14 4:51 PM, [email protected] wrote:

Hi Guys,


I have a 16 node hadoop cluster, running Cloudera's community edition. All 16 
nodes are big powerful boxes with lots of
disk.

Can you provide some actual numbers? How much RAM per machine - how much is allocated to Hadoop, how much to ES and howmuch is actually free

(and thus usable by the OS)?

I would like to add ES to this cluster, but would like advice on how to 
configure/design the ES cluster.

There's a lot of useful information out there - I'll point to two great webinars, namely the pre-flight checklist [1]and getting started with

Elasticsearch [2]

Especially in I/O intensive environments, make sure the OS has enough RAM and that the file-system cache is not trashedsince it has a big impact

(not just on ES but everything that accesses the disk).


I bulk load my data using PIG, which means Map-Reduce. What are the thoughts on 
reducers against ES master nodes? Should
I restrict me reducers to match ES master nodes?

Are you using Elasticsearch Hadoop? I'm asking since it's not the master nodes that matter but rather the data nodes.es-hadoop automatically writesonly to those nodes. Depending on how big is your bulk size and the number of reducers vs your cluster size, you canmight be forced to limit the number

of tasks to avoid overloading the ES cluster.

Any thoughts of advice? At the moment my standard MR parameters kill the ES 
nodes.

See above - how are you writing the data to ES? How many shards do you have in the target index and what's the number ofreducers writing to it at a certain point?Marvel by the way (or any monitoring tool) helps a lot here since it eliminates guesses and actually offers insight intowhat's going on.

From the looks of it, it sounds like you are throwing too much data at one to the ES cluster and not retrying oradjusting the bulk.

Otherwise, consider using es-hadoop. Run a job, take a look at the metrics [3] and tune it accordingly. See also thetroubleshooting page [4].


One last thing - make sure you use the latest ES - it has a _lot_ of 
improvements.

[1] http://www.elasticsearch.org/webinars/elasticsearch-pre-flight-checklist/
[2] http://www.elasticsearch.org/webinars/getting-started-with-elasticsearch/
[3] 
http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/2.1.Beta/metrics.html
[4] 
http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/2.1.Beta/troubleshooting.html

Thanks


--
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to
[email protected] 
<mailto:[email protected]>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2d541d83-5a79-4bc1-b5da-11065b9b568a%40googlegroups.com
<https://groups.google.com/d/msgid/elasticsearch/2d541d83-5a79-4bc1-b5da-11065b9b568a%40googlegroups.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout.


--
Costin

--
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/54049C1B.7060108%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Design advice for ES side-by-side with hadoop cluster?

Reply via email to