Re: Bulk inserting is slow

Frederic Esnault Tue, 24 Jun 2014 09:00:31 -0700

Thanks to both of you, i'll look at this immediately !

Le mardi 24 juin 2014 17:51:04 UTC+2, Jörg Prante a écrit :
>
> You should use the org.elasticsearch.action.bulk.BulkProcessor helper 
> class for concurrent bulk indexing.
>
> Jörg
>
>
> On Tue, Jun 24, 2014 at 5:34 PM, Frederic Esnault <[email protected] 
> <javascript:>> wrote:
>
>> Hi again,
>>
>> any idea about how to parallelize the bulk insert process ?
>> I tried creating 4 BulkInserters extending RecursiveAction and executed 
>> them all, but the result is awful, 3 of them finished very slowly, and one 
>> did not finish (don't know why), and got only 70K docs in ES instead of 265 
>> 000...
>>
>> The result of downsizing the batches sizes to 10 000 is not really big, 
>> total process took approx. 1 second less (Actually this is much lower than 
>> in the previous post, because i moved the importing UI to  my server, close 
>> to one of ES nodes). Was more than 29 seconds, now 28.
>> 28 seconds.
>>
>>
>> *Import CSV file took 28.069 secondes*
>>
>> Here is the insertion code. The Iterator is a CSV reading iterator who 
>> parses lines and returns Record instances (object with generic object 
>> values, indexed as string). MAX_RECORDS is my batch size,  set to 10 000.
>>
>>     public void insert(Iterator<Record> recordsIterator) {
>>         while (recordsIterator.hasNext()) {
>>             batchInsert(recordsIterator, MAX_RECORDS);
>>         }
>>     }
>>
>>     private void batchInsert(Iterator<Record> recordsIterator, int limit) 
>> {
>>         BulkRequestBuilder bulkRequest = client.prepareBulk();
>>         int processed = 0;
>>         try {
>>             logger.log(Level.INFO, "Adding records to bulk insert batch");
>>             while (recordsIterator.hasNext() && processed < limit) {
>>                 processed++;
>>                 Record record = recordsIterator.next();
>>                 IndexRequestBuilder builder = 
>> client.prepareIndex(datasetName, RECORD);
>>                 XContentBuilder data = jsonBuilder();
>>                 data.startObject();
>>                 for (ColumnMetadata column : 
>> dataset.getMetadata().getColumns()) {
>>                     Object value = 
>> record.getCell(column.getName()).getValue();
>>                     if (value == null || (value instanceof String && 
>> value.equals("NULL"))) {
>>                         value = null;
>>                     }
>>                     data.field(column.getNormalizedName(), value);
>>                 }
>>                 data.endObject();
>>                 builder.setSource(data);
>>                 bulkRequest.add(builder);
>>             }
>>             logger.log(Level.INFO, "Added "+ 
>> bulkRequest.numberOfActions() +" records to bulk insert batch. Inserting 
>> batch...");
>>             long current = System.currentTimeMillis();
>>             BulkResponse bulkResponse = 
>> bulkRequest.setConsistencyLevel(WriteConsistencyLevel.ONE).execute().actionGet();
>>             if (bulkResponse.hasFailures()) {
>>                 logger.log(Level.SEVERE, "Could not index : " + 
>> bulkResponse.buildFailureMessage());
>>             }
>>             System.out
>>                     .println(String.format("Bulk insert took %s 
>> secondes", NumberUtils
>>                             .formatSeconds(((double) 
>> (System.currentTimeMillis() - current)) / 1000.0)));
>>         } catch (Exception e) {
>>             e.printStackTrace();
>>         }
>>     }
>>
>> Le mardi 24 juin 2014 13:44:03 UTC+2, Frederic Esnault a écrit :
>>
>>> Thanks for all this.
>>>
>>> I changed my conf, removed all the thread pool config, reduced refresh 
>>> time to 5s according to Michael advice, and limited my batch to 10 000.
>>> I'll see how it works then i'll paralellize the bulk insert.
>>> I'll tell you how it ends up.
>>>
>>> Thanks again !
>>>
>>> Le lundi 23 juin 2014 12:56:14 UTC+2, Jörg Prante a écrit :
>>>>
>>>> Your bulk insert size is too large. It makes no sense to insert 100.000 
>>>> with one request. Use 1000-10000 instead.
>>>>
>>>> Also you should submit bulk requests in parallel and not sequential 
>>>> like you do. Sequential bulk is slow if client CPU/network is not 
>>>> saturated.
>>>>
>>>> Check if you have disabled the index refresh from 1 (1s) to -1 while 
>>>> bulk indexing is active. 30s makes not much sense if you can execute the 
>>>> bulk in this time.
>>>>
>>>> Do not limit indexing memory to 50%.
>>>>
>>>> It makes no sense to increase queue_size for bulk thread pool to 1000. 
>>>> This means you want a single ES node should accept 1000 x 100000 = 100 000 
>>>> 000 = 100m docs at once. This will simply exceeds all reasonable limits 
>>>> and 
>>>> bring the node down with an OOM (if you really have 100m docs).
>>>>
>>>> More advice is possible if you can show your client code how you push 
>>>> docs to ES.
>>>>
>>>> Jörg
>>>>
>>>>
>>>>
>>>> On Mon, Jun 23, 2014 at 12:30 PM, Frederic Esnault <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> I'm inserting around 265 000 documents into an elastic search cluster 
>>>>> composed of 3 nodes (real servers).
>>>>> On two servers i give elastic search 20g of heap, on third one which 
>>>>> has 64g ram, i set 30g of heap for elastic search.
>>>>>
>>>>> I set elastic search configuration to :
>>>>>
>>>>> - 3 shards (1 per server)
>>>>> - 0 replicas
>>>>> - discovery.zen.ping.multicast.enabled: false (and giving on each 
>>>>> node the unicast hostnames of the two other nodes);
>>>>> - and this :
>>>>>
>>>>> indices.memory.index_buffer_size: 50%
>>>>> index.refresh_interval: 30s
>>>>> threadpool:
>>>>>   index:
>>>>>     type: fixed
>>>>>     size: 30
>>>>>     queue_size: 1000
>>>>>   bulk:
>>>>>     queue_size: 1000
>>>>>   bulk:
>>>>>     type: fixed
>>>>>     size: 30
>>>>>     queue_size: 1000
>>>>>   search:
>>>>>     type: fixed
>>>>>     size: 100
>>>>>     queue_size: 200
>>>>>   get:
>>>>>     type: fixed
>>>>>     size: 100
>>>>>     queue_size: 200
>>>>>
>>>>> Indexing is done by groups of 100 000 docs, and here is my application 
>>>>> log :
>>>>> INFO: Adding records to bulk insert batch
>>>>> INFO: Added 100000 records to bulk insert batch. Inserting batch...
>>>>> -- Bulk insert took 38.724 secondes
>>>>> INFO: Adding records to bulk insert batch
>>>>> INFO: Added 100000 records to bulk insert batch. Inserting batch...
>>>>> -- Bulk insert took 31.134 secondes
>>>>> INFO: Adding records to bulk insert batch
>>>>> INFO: Added 64201 records to bulk insert batch. Inserting batch...
>>>>> -- Bulk insert took 17.366 secondes
>>>>>
>>>>> --- Import CSV file took 108.905 secondes ---
>>>>>
>>>>> I'm wondering if this time is correct or not, or if there is something 
>>>>> i can do to improve performances ?
>>>>>  
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "elasticsearch" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>> msgid/elasticsearch/3a38e79e-9afb-4146-a7e1-7984ec082e22%
>>>>> 40googlegroups.com 
>>>>> <https://groups.google.com/d/msgid/elasticsearch/3a38e79e-9afb-4146-a7e1-7984ec082e22%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/32280fd5-5879-4424-882d-5b4e7674557a%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/32280fd5-5879-4424-882d-5b4e7674557a%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>


-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/9fc68d44-caf9-4315-b846-29ac5e1f8988%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Bulk inserting is slow

Reply via email to