On the index settings side, you can dynamically turn off the index refresh_interval and also reduce the number of shard replicas for the duration of the bulk import.
Described here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-update-settings.html#bulk On Wed, Nov 19, 2014 at 2:53 AM, <[email protected]> wrote: > Hello, > > I'm trying to do a bulk load of ~10M JSON docs (12.8Gb) with some > geographical information into an elasticsearch index. With our current > params, the loading is taking around 20-25 minutes to run, but we think it > should be faster. Are these numbers similar to what other users are > getting? Do you have any hints on how to get better performance? Any help > will be appreciated. Please find the details below. > > Our ES cluster is version 1.1.1 with 11 nodes, and we are using > Elasticsearch-MapReduce libraries 2.0.2 to do the bulk-load, setting the > numbers of reducers to 11. Other params we use are: > > es.input.json=true > es.mapping.id=id > es.batch.size.bytes=10M > es.batch.size.entries=10000 > > The average doc size is 1.3Kb, and each doc contains a "bbox" field with > the shape definition like this: > > "bbox": { > "type": "envelope", > "coordinates": [ > [ > -77.08488844489459, > 38.9502995339637 > ], > [ > -77.0844224567727, > 38.9502305534064 > ] > ] > } > > We are using the following mapping for this index, because these are the 3 > fields of our docs we are more interested in: > > { > "properties": { > "bbox": { > "precision": "10m", > "tree": "quadtree", > "type": "geo_shape" > }, > "id": { > "type": "string", > "index": "not_analyzed" > }, > "streets": { > "type": "string" > } > } > } > > This is a typical output of the MapReduce job: > > 14/11/17 09:05:44 INFO mapred.JobClient: Elasticsearch Hadoop Counters > 14/11/17 09:05:44 INFO mapred.JobClient: Bulk Retries=0 > 14/11/17 09:05:44 INFO mapred.JobClient: Bulk Retries Total Time(ms)=0 > 14/11/17 09:05:44 INFO mapred.JobClient: Bulk Total=1375 > 14/11/17 09:05:44 INFO mapred.JobClient: Bulk Total Time(ms)=11714959 > 14/11/17 09:05:44 INFO mapred.JobClient: Bytes Accepted=14351811146 > 14/11/17 09:05:44 INFO mapred.JobClient: Bytes Received=5498829 > 14/11/17 09:05:44 INFO mapred.JobClient: Bytes Retried=0 > 14/11/17 09:05:44 INFO mapred.JobClient: Bytes Sent=14351811146 > 14/11/17 09:05:44 INFO mapred.JobClient: Documents Accepted=10129699 > 14/11/17 09:05:44 INFO mapred.JobClient: Documents Received=0 > 14/11/17 09:05:44 INFO mapred.JobClient: Documents Retried=0 > 14/11/17 09:05:44 INFO mapred.JobClient: Documents Sent=10129699 > 14/11/17 09:05:44 INFO mapred.JobClient: Network Retries=0 > 14/11/17 09:05:44 INFO mapred.JobClient: Network Total > Time(ms)=11732552 > 14/11/17 09:05:44 INFO mapred.JobClient: Node Retries=0 > 14/11/17 09:05:44 INFO mapred.JobClient: Scroll Total=0 > 14/11/17 09:05:44 INFO mapred.JobClient: Scroll Total Time(ms)=0 > > Thanks, > Xavier. > > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/70956234-78d0-4ee2-9536-398ac529b76a%40googlegroups.com > <https://groups.google.com/d/msgid/elasticsearch/70956234-78d0-4ee2-9536-398ac529b76a%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- Nick Canzoneri Developer, Wildbit <http://wildbit.com/> Beanstalk <http://beanstalkapp.com/>, Postmark <http://postmarkapp.com/>, dploy.io -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKWm5yPDSs_PABPi7Ydnr0h8utGAwOTOJuyDvEBm4fNMLG-Sqg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
