Re: ES Indexing from Hadoop Issues

Sudhir Rao Sat, 23 May 2015 12:29:09 -0700

Here is what the indexing performance i see : it takes 10 mins 29 seconds 
to finish indexing 626K records using the mapreduce (pig).  Is this the 
expected performance for 4 node elasticsearch ?


Output(s):

Successfully stored 626283 records in: "index1/raw_data"


Counters:

Total records written : 626283

Total bytes written : 0

Spillable Memory Manager spill count : 0

Total bags proactively spilled: 0

Total records proactively spilled: 0



On Saturday, May 23, 2015 at 12:24:50 PM UTC-7, Sudhir Rao wrote:
>
> I see the following in the elasticsearch logs
>
> >> stop throttling indexing: numMergesInFlight=4, maxNumMerges=5
>
> The indexing however happens for few million records before all the mapper 
> fail - please see the attached error screenshot
>
>
>
> On Thursday, May 21, 2015 at 12:46:29 AM UTC-7, Allan Mitchell wrote:
>>
>> Hi
>>
>> The error is a Grunt error which suggests Pig is throwing it not ES.  
>> What do the PIG logs say?  What makes you think ES is the issue?
>>
>> I know it works with smaller data but that also means Pig works with 
>> smaller data not just ES.
>>
>> Allan
>>
>> On 21 May 2015 at 01:34, Sudhir Rao <ysu...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I have 4 node ES running 
>>>
>>> ElasticSearch : 1.5.2
>>> OS : RHEL 6.x
>>> Java : 1.7
>>> CPU : 16 cores
>>> 2 machines : 60 GB RAM, 10 TB disk
>>> 2 machines : 120 GB RAM, 5 TB disk
>>>
>>>
>>> I also have a 500 node hadoop cluster and am trying to index data from 
>>> Hadoop which is in Avro Format
>>>
>>> Daily size : 1.2 TB
>>> Hourly size : 40-60 GB
>>>
>>>
>>> elasticsearch.yml config
>>> ==================
>>>
>>> cluster.name: zebra
>>> index.mapping.ignore_malformed: true
>>> index.merge.scheduler.max_thread_count: 1
>>> index.store.throttle.type: none
>>> index.refresh_interval: -1
>>> index.translog.flush_threshold_size: 1024000000
>>> discovery.zen.ping.unicast.hosts: ["node1","node2","node3","node4"]
>>> path.data: 
>>> /hadoop01/es,/hadoop02/es,/hadoop03/es,/hadoop04/es,/hadoop05/es,/hadoop06/es,/hadoop07/es,/hadoop08/es,/hadoop09/es,/hadoop10/es,/hadoop11/es,/hadoop12/es
>>> bootstrap.mlockall: true
>>> indices.memory.index_buffer_size: 30%
>>> index.translog.flush_threshold_ops: 50000
>>> index.store.type: mmapfs
>>>
>>>
>>> Cluster Settings 
>>> ============
>>>
>>> $ curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
>>> { 
>>> "cluster_name" : "zebra", 
>>> "status" : "green", 
>>> "timed_out" : false, 
>>> "number_of_nodes" : 4, 
>>> "number_of_data_nodes" : 4, 
>>> "active_primary_shards" : 21, 
>>> "active_shards" : 22, 
>>> "relocating_shards" : 0, 
>>> "initializing_shards" : 0, 
>>> "unassigned_shards" : 0, 
>>> "number_of_pending_tasks" : 0 
>>> }
>>>
>>>
>>> Pig Script:
>>> ========
>>>
>>> avro_data = LOAD '$INPUT_PATH' USING AvroStorage ();
>>>
>>> temp_projection = FOREACH avro_data GENERATE 
>>> our.own.udf.ToJsonString(headers,data) as data;
>>>
>>> STORE temp_projection INTO 'fpti/raw_data' USING 
>>> org.elasticsearch.hadoop.pig.EsStorage ('es.resource = 
>>> fpti/raw_data','es.input.json=true', 'node1,node2,node3,node4', 
>>> 'mapreduce.map.speculative=false','mapreduce.reduce.speculative=false','es.batch.size.bytes=512mb','es.batch.size.entries=1');
>>> When i run the above, there are around 300 mappers  none of them 
>>> complete and every time the job fails with the below error. There is some 
>>> documents that gets indexed though.
>>>
>>> *Error:*
>>>
>>> *2015-05-20 15:40:20,618 [main] ERROR 
>>> org.apache.pig.tools.grunt.GruntParser - ERROR 2999: Unexpected internal 
>>> error. Could not write all entries [1/8448] (maybe ES was overloaded?). 
>>> Bailing out...*
>>>
>>> The job however finishes when the data size is few thousands
>>>
>>>
>>> Please let me know what else i can do to increase my indexing throughput
>>>
>>>
>>> regards
>>>
>>> #sudhir
>>>
>>> -- 
>>> Please update your bookmarks! We have moved to 
>>> https://discuss.elastic.co/
>>> --- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to elasticsearc...@googlegroups.com.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/elasticsearch/6312a8b6-bde7-40d6-bbf0-8b3fccf7cd12%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/elasticsearch/6312a8b6-bde7-40d6-bbf0-8b3fccf7cd12%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 
Please update your bookmarks! We have moved to https://discuss.elastic.co/
--- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/53d91e49-3a70-460d-93b3-0255e242fb42%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: ES Indexing from Hadoop Issues

Reply via email to