Hello,

I'm trying to do a bulk load of ~10M JSON docs (12.8Gb) with some 
geographical information into an elasticsearch index. With our current 
params, the loading is taking around 20-25 minutes to run, but we think it 
should be faster. Are these numbers similar to what other users are 
getting? Do you have any hints on how to get better performance? Any help 
will be appreciated. Please find the details below.

Our ES cluster is version 1.1.1 with 11 nodes, and we are using 
Elasticsearch-MapReduce libraries 2.0.2 to do the bulk-load, setting the 
numbers of reducers to 11. Other params we use are:

es.input.json=true
es.mapping.id=id
es.batch.size.bytes=10M
es.batch.size.entries=10000

The average doc size is 1.3Kb, and each doc contains a "bbox" field with 
the shape definition like this:

"bbox": {
"type": "envelope",
"coordinates": [
[
-77.08488844489459,
38.9502995339637
],
[
-77.0844224567727,
38.9502305534064
]
]
}

We are using the following mapping for this index, because these are the 3 
fields of our docs we are more interested in:

{
    "properties": {
        "bbox": {
            "precision": "10m",
            "tree": "quadtree",
            "type": "geo_shape"
        },
        "id": {
          "type": "string",
          "index": "not_analyzed"
        },
        "streets": {
          "type": "string"
        }
    }
}

This is a typical output of the MapReduce job:

14/11/17 09:05:44 INFO mapred.JobClient:   Elasticsearch Hadoop Counters
14/11/17 09:05:44 INFO mapred.JobClient:     Bulk Retries=0
14/11/17 09:05:44 INFO mapred.JobClient:     Bulk Retries Total Time(ms)=0
14/11/17 09:05:44 INFO mapred.JobClient:     Bulk Total=1375
14/11/17 09:05:44 INFO mapred.JobClient:     Bulk Total Time(ms)=11714959
14/11/17 09:05:44 INFO mapred.JobClient:     Bytes Accepted=14351811146
14/11/17 09:05:44 INFO mapred.JobClient:     Bytes Received=5498829
14/11/17 09:05:44 INFO mapred.JobClient:     Bytes Retried=0
14/11/17 09:05:44 INFO mapred.JobClient:     Bytes Sent=14351811146
14/11/17 09:05:44 INFO mapred.JobClient:     Documents Accepted=10129699
14/11/17 09:05:44 INFO mapred.JobClient:     Documents Received=0
14/11/17 09:05:44 INFO mapred.JobClient:     Documents Retried=0
14/11/17 09:05:44 INFO mapred.JobClient:     Documents Sent=10129699
14/11/17 09:05:44 INFO mapred.JobClient:     Network Retries=0
14/11/17 09:05:44 INFO mapred.JobClient:     Network Total Time(ms)=11732552
14/11/17 09:05:44 INFO mapred.JobClient:     Node Retries=0
14/11/17 09:05:44 INFO mapred.JobClient:     Scroll Total=0
14/11/17 09:05:44 INFO mapred.JobClient:     Scroll Total Time(ms)=0

Thanks,
Xavier.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/70956234-78d0-4ee2-9536-398ac529b76a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to