Wow thanks a LOT, Cedric and Jörg ! Got down to 15,8 seconds for 264000 documents.
Bulk processing took 15.863 seconds Import CSV file took 15.874 secondes If you have any tips to tune it, i take too :) For example i didn't use the MultiGetRequestBuilder, just a new IndexRequest for each doc. Would it help to use the MultiGet ? Can't really figure out how to use it. Le mardi 24 juin 2014 17:48:43 UTC+2, Cédric Hourcade a écrit : > > Hello, > > You can use the BulkProcessor class to do the work for you: > > https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/action/bulk/BulkProcessor.java > > > Just configure/instantiate the class and .add() your index requests. > See: > https://github.com/elasticsearch/elasticsearch/blob/master/src/test/java/org/elasticsearch/action/bulk/BulkProcessorTests.java > > > Cédric Hourcade > [email protected] <javascript:> > > > On Tue, Jun 24, 2014 at 5:34 PM, Frederic Esnault > <[email protected] <javascript:>> wrote: > > Hi again, > > > > any idea about how to parallelize the bulk insert process ? > > I tried creating 4 BulkInserters extending RecursiveAction and executed > them > > all, but the result is awful, 3 of them finished very slowly, and one > did > > not finish (don't know why), and got only 70K docs in ES instead of 265 > > 000... > > > > The result of downsizing the batches sizes to 10 000 is not really big, > > total process took approx. 1 second less (Actually this is much lower > than > > in the previous post, because i moved the importing UI to my server, > close > > to one of ES nodes). Was more than 29 seconds, now 28. > > 28 seconds. > > > > > > Import CSV file took 28.069 secondes > > > > Here is the insertion code. The Iterator is a CSV reading iterator who > > parses lines and returns Record instances (object with generic object > > values, indexed as string). MAX_RECORDS is my batch size, set to 10 > 000. > > > > public void insert(Iterator<Record> recordsIterator) { > > while (recordsIterator.hasNext()) { > > batchInsert(recordsIterator, MAX_RECORDS); > > } > > } > > > > private void batchInsert(Iterator<Record> recordsIterator, int > limit) { > > BulkRequestBuilder bulkRequest = client.prepareBulk(); > > int processed = 0; > > try { > > logger.log(Level.INFO, "Adding records to bulk insert > batch"); > > while (recordsIterator.hasNext() && processed < limit) { > > processed++; > > Record record = recordsIterator.next(); > > IndexRequestBuilder builder = > > client.prepareIndex(datasetName, RECORD); > > XContentBuilder data = jsonBuilder(); > > data.startObject(); > > for (ColumnMetadata column : > > dataset.getMetadata().getColumns()) { > > Object value = > > record.getCell(column.getName()).getValue(); > > if (value == null || (value instanceof String && > > value.equals("NULL"))) { > > value = null; > > } > > data.field(column.getNormalizedName(), value); > > } > > data.endObject(); > > builder.setSource(data); > > bulkRequest.add(builder); > > } > > logger.log(Level.INFO, "Added "+ > bulkRequest.numberOfActions() > > +" records to bulk insert batch. Inserting batch..."); > > long current = System.currentTimeMillis(); > > BulkResponse bulkResponse = > > > bulkRequest.setConsistencyLevel(WriteConsistencyLevel.ONE).execute().actionGet(); > > > > if (bulkResponse.hasFailures()) { > > logger.log(Level.SEVERE, "Could not index : " + > > bulkResponse.buildFailureMessage()); > > } > > System.out > > .println(String.format("Bulk insert took %s > secondes", > > NumberUtils > > .formatSeconds(((double) > > (System.currentTimeMillis() - current)) / 1000.0))); > > } catch (Exception e) { > > e.printStackTrace(); > > } > > } > > > > Le mardi 24 juin 2014 13:44:03 UTC+2, Frederic Esnault a écrit : > >> > >> Thanks for all this. > >> > >> I changed my conf, removed all the thread pool config, reduced refresh > >> time to 5s according to Michael advice, and limited my batch to 10 000. > >> I'll see how it works then i'll paralellize the bulk insert. > >> I'll tell you how it ends up. > >> > >> Thanks again ! > >> > >> Le lundi 23 juin 2014 12:56:14 UTC+2, Jörg Prante a écrit : > >>> > >>> Your bulk insert size is too large. It makes no sense to insert > 100.000 > >>> with one request. Use 1000-10000 instead. > >>> > >>> Also you should submit bulk requests in parallel and not sequential > like > >>> you do. Sequential bulk is slow if client CPU/network is not > saturated. > >>> > >>> Check if you have disabled the index refresh from 1 (1s) to -1 while > bulk > >>> indexing is active. 30s makes not much sense if you can execute the > bulk in > >>> this time. > >>> > >>> Do not limit indexing memory to 50%. > >>> > >>> It makes no sense to increase queue_size for bulk thread pool to 1000. > >>> This means you want a single ES node should accept 1000 x 100000 = 100 > 000 > >>> 000 = 100m docs at once. This will simply exceeds all reasonable > limits and > >>> bring the node down with an OOM (if you really have 100m docs). > >>> > >>> More advice is possible if you can show your client code how you push > >>> docs to ES. > >>> > >>> Jörg > >>> > >>> > >>> > >>> On Mon, Jun 23, 2014 at 12:30 PM, Frederic Esnault > >>> <[email protected]> wrote: > >>>> > >>>> Hi everyone, > >>>> > >>>> I'm inserting around 265 000 documents into an elastic search cluster > >>>> composed of 3 nodes (real servers). > >>>> On two servers i give elastic search 20g of heap, on third one which > has > >>>> 64g ram, i set 30g of heap for elastic search. > >>>> > >>>> I set elastic search configuration to : > >>>> > >>>> - 3 shards (1 per server) > >>>> - 0 replicas > >>>> - discovery.zen.ping.multicast.enabled: false (and giving on each > node > >>>> the unicast hostnames of the two other nodes); > >>>> - and this : > >>>> > >>>> indices.memory.index_buffer_size: 50% > >>>> index.refresh_interval: 30s > >>>> threadpool: > >>>> index: > >>>> type: fixed > >>>> size: 30 > >>>> queue_size: 1000 > >>>> bulk: > >>>> queue_size: 1000 > >>>> bulk: > >>>> type: fixed > >>>> size: 30 > >>>> queue_size: 1000 > >>>> search: > >>>> type: fixed > >>>> size: 100 > >>>> queue_size: 200 > >>>> get: > >>>> type: fixed > >>>> size: 100 > >>>> queue_size: 200 > >>>> > >>>> Indexing is done by groups of 100 000 docs, and here is my > application > >>>> log : > >>>> INFO: Adding records to bulk insert batch > >>>> INFO: Added 100000 records to bulk insert batch. Inserting batch... > >>>> -- Bulk insert took 38.724 secondes > >>>> INFO: Adding records to bulk insert batch > >>>> INFO: Added 100000 records to bulk insert batch. Inserting batch... > >>>> -- Bulk insert took 31.134 secondes > >>>> INFO: Adding records to bulk insert batch > >>>> INFO: Added 64201 records to bulk insert batch. Inserting batch... > >>>> -- Bulk insert took 17.366 secondes > >>>> > >>>> --- Import CSV file took 108.905 secondes --- > >>>> > >>>> I'm wondering if this time is correct or not, or if there is > something i > >>>> can do to improve performances ? > >>>> > >>>> -- > >>>> You received this message because you are subscribed to the Google > >>>> Groups "elasticsearch" group. > >>>> To unsubscribe from this group and stop receiving emails from it, > send > >>>> an email to [email protected]. > >>>> To view this discussion on the web visit > >>>> > https://groups.google.com/d/msgid/elasticsearch/3a38e79e-9afb-4146-a7e1-7984ec082e22%40googlegroups.com. > > > >>>> For more options, visit https://groups.google.com/d/optout. > >>> > >>> > > -- > > You received this message because you are subscribed to the Google > Groups > > "elasticsearch" group. > > To unsubscribe from this group and stop receiving emails from it, send > an > > email to [email protected] <javascript:>. > > To view this discussion on the web visit > > > https://groups.google.com/d/msgid/elasticsearch/32280fd5-5879-4424-882d-5b4e7674557a%40googlegroups.com. > > > > > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b7439ca9-6074-4eef-94a6-4a3019a95759%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
