Re: Indexing large number of documents

Jilles van Gurp Mon, 17 Feb 2014 04:21:14 -0800

You'll want to use the batch API instead of indexing one document at the 
time. That scales a lot better. I've done tens of millions of documents 
like that in minutes. Basically, you can use mutlithreading with batch as 
well but you may want to not outnumber the number of cpus you can dedicate 
to indexing. Keep the batch sizes limited to a few hundred to a few 
thousand at most. Basically go for a size that es still handles in around a 
second or so.


If you really need to index one document at the time, you'll want to 
probably reduce the ambition level a bit with the number of threads. The 
error you are getting means that all nodes are busy with your previous 
requests. Increasing the timeout won't fix your problem; these requests 
normally should be in the range of a few ms and the fact that they are not, 
means you are hitting a bottleneck somewhere.

Jilles



On Monday, February 17, 2014 10:04:19 AM UTC+1, Petr Janský wrote:
>
> Hello,
>
> I'm trying to index >300k docs using Java API.
>
> *public class Fetcher {*
> * public static String server = "localhost"; *
> * public static Integer port = 9300;*
> * public static String index = "default";*
> * public static String type = "default";*
> * public static String typeAttributename = null;*
>  * static Client client = null;*
> * private static Fetcher inst;*
> * Settings settings = ImmutableSettings.settingsBuilder()*
> * .put("cluster.name <http://cluster.name>", "elasticsearch")*
> * .put("node.name <http://node.name>", "Killer")*
> * .build();*
> * public synchronized static Fetcher getInstace(){*
> * if(inst == null){*
> * inst = new Fetcher();*
> * }*
> * return inst;*
> * }*
> * public Fetcher() {*
> * client = new TransportClient(settings).addTransportAddress(new 
> InetSocketTransportAddress(server, port));*
> * }*
> * public void index(DocumentVo document) {*
> * try {*
> * String type = Fetcher.type;*
> * if(typeAttributename != null && 
> document.getData().get(typeAttributename) != null){*
> * type = document.getData().get(typeAttributename).toString();*
> * type = type.toLowerCase();*
> * }*
> * IndexRequestBuilder rs = 
> client.prepareIndex().setIndex(index).setType(type);*
> * rs.setTimeout(new TimeValue(10000));*
> * rs.setSource(document.getData());*
> * rs.execute().actionGet();*
> * } catch (Exception e) {*
> * e.printStackTrace();*
> * client.close();*
> * client = new TransportClient(settings).addTransportAddress(new 
> InetSocketTransportAddress(server, port));*
> * index(document);*
> * } *
> * }*
> * public void close(){*
> * client.close();*
> * }*
> *}*
>
> in ~20 threads I run
>
> *Fetcher.getInstace().index(document);*
>
> I've created my own tokenizer filter that is quite slow so I'm getting
>
> Feb 17, 2014 9:53:51 AM org.elasticsearch.client.transport
> INFO: [Killer] failed to get node info for 
> [#transport#-1][inet[localhost/127.0.0.1:9300]], disconnecting...
> org.elasticsearch.transport.ReceiveTimeoutTransportException: 
> [][inet[localhost/127.0.0.1:9300]][cluster/nodes/info] request_id [2899] 
> timed out after [5001ms]
> at 
> org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:351)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at java.lang.Thread.run(Unknown Source)
>
> org.elasticsearch.client.transport.NoNodeAvailableException: No node 
> available
> at 
> org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:249)
> at 
> org.elasticsearch.action.TransportActionNodeProxy$1.handleException(TransportActionNodeProxy.java:84)
> at 
> org.elasticsearch.transport.TransportService$Adapter$2$1.run(TransportService.java:311)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at java.lang.Thread.run(Unknown Source)
>
>
> It seems that 
> *rs.setTimeout(new TimeValue(10000));*
> in my index method doesn't work.
>
> How can I setup timeout for indexing using API?
>
> Is it correct to use one TransportCilent for multiple(10-60) threads?
>
> Thanks 
> Petr
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/88647d41-58fd-4e27-9e6a-80ee312fb439%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Indexing large number of documents

Reply via email to