ES Indexing from Hadoop Issues

Sudhir Rao Wed, 20 May 2015 17:34:52 -0700

Hi all,

I have 4 node ES running


ElasticSearch : 1.5.2
OS : RHEL 6.x
Java : 1.7
CPU : 16 cores
2 machines : 60 GB RAM, 10 TB disk
2 machines : 120 GB RAM, 5 TB disk


I also have a 500 node hadoop cluster and am trying to index data from 
Hadoop which is in Avro Format

Daily size : 1.2 TB
Hourly size : 40-60 GB


elasticsearch.yml config
==================

cluster.name: zebra
index.mapping.ignore_malformed: true
index.merge.scheduler.max_thread_count: 1
index.store.throttle.type: none
index.refresh_interval: -1
index.translog.flush_threshold_size: 1024000000
discovery.zen.ping.unicast.hosts: ["node1","node2","node3","node4"]
path.data: 
/hadoop01/es,/hadoop02/es,/hadoop03/es,/hadoop04/es,/hadoop05/es,/hadoop06/es,/hadoop07/es,/hadoop08/es,/hadoop09/es,/hadoop10/es,/hadoop11/es,/hadoop12/es
bootstrap.mlockall: true
indices.memory.index_buffer_size: 30%
index.translog.flush_threshold_ops: 50000
index.store.type: mmapfs


Cluster Settings 
============

$ curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{ 
"cluster_name" : "zebra", 
"status" : "green", 
"timed_out" : false, 
"number_of_nodes" : 4, 
"number_of_data_nodes" : 4, 
"active_primary_shards" : 21, 
"active_shards" : 22, 
"relocating_shards" : 0, 
"initializing_shards" : 0, 
"unassigned_shards" : 0, 
"number_of_pending_tasks" : 0 
}


Pig Script:
========

avro_data = LOAD '$INPUT_PATH' USING AvroStorage ();

temp_projection = FOREACH avro_data GENERATE 
our.own.udf.ToJsonString(headers,data) as data;

STORE temp_projection INTO 'fpti/raw_data' USING 
org.elasticsearch.hadoop.pig.EsStorage ('es.resource = 
fpti/raw_data','es.input.json=true', 'node1,node2,node3,node4', 
'mapreduce.map.speculative=false','mapreduce.reduce.speculative=false','es.batch.size.bytes=512mb','es.batch.size.entries=1');
When i run the above, there are around 300 mappers  none of them complete 
and every time the job fails with the below error. There is some documents 
that gets indexed though.

*Error:*

*2015-05-20 15:40:20,618 [main] ERROR 
org.apache.pig.tools.grunt.GruntParser - ERROR 2999: Unexpected internal 
error. Could not write all entries [1/8448] (maybe ES was overloaded?). 
Bailing out...*

The job however finishes when the data size is few thousands


Please let me know what else i can do to increase my indexing throughput


regards

#sudhir

-- 
Please update your bookmarks! We have moved to https://discuss.elastic.co/
--- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/6312a8b6-bde7-40d6-bbf0-8b3fccf7cd12%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ES Indexing from Hadoop Issues

Reply via email to