Re: ES Indexing from Hadoop Issues

Allan Mitchell Thu, 21 May 2015 00:47:11 -0700

Hi

The error is a Grunt error which suggests Pig is throwing it not ES.  What
do the PIG logs say?  What makes you think ES is the issue?


I know it works with smaller data but that also means Pig works with
smaller data not just ES.

Allan

On 21 May 2015 at 01:34, Sudhir Rao <ysud...@gmail.com> wrote:

> Hi all,
>
> I have 4 node ES running
>
> ElasticSearch : 1.5.2
> OS : RHEL 6.x
> Java : 1.7
> CPU : 16 cores
> 2 machines : 60 GB RAM, 10 TB disk
> 2 machines : 120 GB RAM, 5 TB disk
>
>
> I also have a 500 node hadoop cluster and am trying to index data from
> Hadoop which is in Avro Format
>
> Daily size : 1.2 TB
> Hourly size : 40-60 GB
>
>
> elasticsearch.yml config
> ==================
>
> cluster.name: zebra
> index.mapping.ignore_malformed: true
> index.merge.scheduler.max_thread_count: 1
> index.store.throttle.type: none
> index.refresh_interval: -1
> index.translog.flush_threshold_size: 1024000000
> discovery.zen.ping.unicast.hosts: ["node1","node2","node3","node4"]
> path.data:
> /hadoop01/es,/hadoop02/es,/hadoop03/es,/hadoop04/es,/hadoop05/es,/hadoop06/es,/hadoop07/es,/hadoop08/es,/hadoop09/es,/hadoop10/es,/hadoop11/es,/hadoop12/es
> bootstrap.mlockall: true
> indices.memory.index_buffer_size: 30%
> index.translog.flush_threshold_ops: 50000
> index.store.type: mmapfs
>
>
> Cluster Settings
> ============
>
> $ curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
> {
> "cluster_name" : "zebra",
> "status" : "green",
> "timed_out" : false,
> "number_of_nodes" : 4,
> "number_of_data_nodes" : 4,
> "active_primary_shards" : 21,
> "active_shards" : 22,
> "relocating_shards" : 0,
> "initializing_shards" : 0,
> "unassigned_shards" : 0,
> "number_of_pending_tasks" : 0
> }
>
>
> Pig Script:
> ========
>
> avro_data = LOAD '$INPUT_PATH' USING AvroStorage ();
>
> temp_projection = FOREACH avro_data GENERATE
> our.own.udf.ToJsonString(headers,data) as data;
>
> STORE temp_projection INTO 'fpti/raw_data' USING
> org.elasticsearch.hadoop.pig.EsStorage ('es.resource =
> fpti/raw_data','es.input.json=true', 'node1,node2,node3,node4',
> 'mapreduce.map.speculative=false','mapreduce.reduce.speculative=false','es.batch.size.bytes=512mb','es.batch.size.entries=1');
> When i run the above, there are around 300 mappers  none of them complete
> and every time the job fails with the below error. There is some documents
> that gets indexed though.
>
> *Error:*
>
> *2015-05-20 15:40:20,618 [main] ERROR
> org.apache.pig.tools.grunt.GruntParser - ERROR 2999: Unexpected internal
> error. Could not write all entries [1/8448] (maybe ES was overloaded?).
> Bailing out...*
>
> The job however finishes when the data size is few thousands
>
>
> Please let me know what else i can do to increase my indexing throughput
>
>
> regards
>
> #sudhir
>
> --
> Please update your bookmarks! We have moved to https://discuss.elastic.co/
> ---
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/6312a8b6-bde7-40d6-bbf0-8b3fccf7cd12%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/6312a8b6-bde7-40d6-bbf0-8b3fccf7cd12%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
Please update your bookmarks! We have moved to https://discuss.elastic.co/
--- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAECdJzAKEZzFv4q_H9auBzAN%2B5b91XEzz1SUye-xHA1nRAcX_w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: ES Indexing from Hadoop Issues

Reply via email to