Here is what the indexing performance i see : it takes 10 mins 29 seconds to finish indexing 626K records using the mapreduce (pig). Is this the expected performance for 4 node elasticsearch ?
Output(s): Successfully stored 626283 records in: "index1/raw_data" Counters: Total records written : 626283 Total bytes written : 0 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0 On Saturday, May 23, 2015 at 12:24:50 PM UTC-7, Sudhir Rao wrote: > > I see the following in the elasticsearch logs > > >> stop throttling indexing: numMergesInFlight=4, maxNumMerges=5 > > The indexing however happens for few million records before all the mapper > fail - please see the attached error screenshot > > > > On Thursday, May 21, 2015 at 12:46:29 AM UTC-7, Allan Mitchell wrote: >> >> Hi >> >> The error is a Grunt error which suggests Pig is throwing it not ES. >> What do the PIG logs say? What makes you think ES is the issue? >> >> I know it works with smaller data but that also means Pig works with >> smaller data not just ES. >> >> Allan >> >> On 21 May 2015 at 01:34, Sudhir Rao <ysu...@gmail.com> wrote: >> >>> Hi all, >>> >>> I have 4 node ES running >>> >>> ElasticSearch : 1.5.2 >>> OS : RHEL 6.x >>> Java : 1.7 >>> CPU : 16 cores >>> 2 machines : 60 GB RAM, 10 TB disk >>> 2 machines : 120 GB RAM, 5 TB disk >>> >>> >>> I also have a 500 node hadoop cluster and am trying to index data from >>> Hadoop which is in Avro Format >>> >>> Daily size : 1.2 TB >>> Hourly size : 40-60 GB >>> >>> >>> elasticsearch.yml config >>> ================== >>> >>> cluster.name: zebra >>> index.mapping.ignore_malformed: true >>> index.merge.scheduler.max_thread_count: 1 >>> index.store.throttle.type: none >>> index.refresh_interval: -1 >>> index.translog.flush_threshold_size: 1024000000 >>> discovery.zen.ping.unicast.hosts: ["node1","node2","node3","node4"] >>> path.data: >>> /hadoop01/es,/hadoop02/es,/hadoop03/es,/hadoop04/es,/hadoop05/es,/hadoop06/es,/hadoop07/es,/hadoop08/es,/hadoop09/es,/hadoop10/es,/hadoop11/es,/hadoop12/es >>> bootstrap.mlockall: true >>> indices.memory.index_buffer_size: 30% >>> index.translog.flush_threshold_ops: 50000 >>> index.store.type: mmapfs >>> >>> >>> Cluster Settings >>> ============ >>> >>> $ curl -XGET 'http://localhost:9200/_cluster/health?pretty=true' >>> { >>> "cluster_name" : "zebra", >>> "status" : "green", >>> "timed_out" : false, >>> "number_of_nodes" : 4, >>> "number_of_data_nodes" : 4, >>> "active_primary_shards" : 21, >>> "active_shards" : 22, >>> "relocating_shards" : 0, >>> "initializing_shards" : 0, >>> "unassigned_shards" : 0, >>> "number_of_pending_tasks" : 0 >>> } >>> >>> >>> Pig Script: >>> ======== >>> >>> avro_data = LOAD '$INPUT_PATH' USING AvroStorage (); >>> >>> temp_projection = FOREACH avro_data GENERATE >>> our.own.udf.ToJsonString(headers,data) as data; >>> >>> STORE temp_projection INTO 'fpti/raw_data' USING >>> org.elasticsearch.hadoop.pig.EsStorage ('es.resource = >>> fpti/raw_data','es.input.json=true', 'node1,node2,node3,node4', >>> 'mapreduce.map.speculative=false','mapreduce.reduce.speculative=false','es.batch.size.bytes=512mb','es.batch.size.entries=1'); >>> When i run the above, there are around 300 mappers none of them >>> complete and every time the job fails with the below error. There is some >>> documents that gets indexed though. >>> >>> *Error:* >>> >>> *2015-05-20 15:40:20,618 [main] ERROR >>> org.apache.pig.tools.grunt.GruntParser - ERROR 2999: Unexpected internal >>> error. Could not write all entries [1/8448] (maybe ES was overloaded?). >>> Bailing out...* >>> >>> The job however finishes when the data size is few thousands >>> >>> >>> Please let me know what else i can do to increase my indexing throughput >>> >>> >>> regards >>> >>> #sudhir >>> >>> -- >>> Please update your bookmarks! We have moved to >>> https://discuss.elastic.co/ >>> --- >>> You received this message because you are subscribed to the Google >>> Groups "elasticsearch" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to elasticsearc...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/elasticsearch/6312a8b6-bde7-40d6-bbf0-8b3fccf7cd12%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/elasticsearch/6312a8b6-bde7-40d6-bbf0-8b3fccf7cd12%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- Please update your bookmarks! We have moved to https://discuss.elastic.co/ --- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/53d91e49-3a70-460d-93b3-0255e242fb42%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.