There are a few things going on here.

When you say 200 entries, is this per second?? It might be that it's 
chunking them into 200 docs, but you're really just smashing it with more 
than you think. 
- 
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cat-thread-pool.html
 
This doc can show you what the different thread pools are doing. If you 
notice that it's rejecting large amounts of documents you might find you're 
bulk queue is too low. Increasing this might help, but be a little bit 
careful. If you increase it to something too huge you can easily break ES. 

First thing is it's always a good idea to not go above 32gb of heap. By 
doing so you disable compressed pointers and memory can run away. The file 
system cache will happily consume the rest of your memory. 

you could also not have any replicas while doing the bulk import and then 
increase this after the import has completed. That way you're not writing 
out replicas while trying to bulk import. Replicas are only useful for 
reads not indexing. 

Another important thing is working out your mappings for your index. Are 
you enumerating all fields or are there fields that don't require full text 
searching etc?

What is the refresh rate on this index? It could be trying to refresh the 
index and is getting busy doing this. Although I wouldn't expect a cluster 
of this size to be having trouble indexing 200 with this. (if its 200 per 
second)  

I've also found that running the same number of shards as nodes can have a 
bad impact on the cluster as all nodes are busy trying to index and then 
they can't perform other cluster functions. - To give you an idea each 
shard should be around 20 - 30 gb in size
Try reducing your shard count to 3 or maybe even 2 and then increase 
replicas. 

I hope this helps. 

Cheers, 
Rob


On Monday, November 17, 2014 8:04:28 PM UTC+1, Chen Wang wrote:
>
> Hey, Guys,
> I am loading a hive table of around 10million records into ES regularly. 
> Each document is small with 5-6 attributes. My Es cluster has 7 nodes, each 
> has 4 core and 128G. ES was allocated with 60% of the memory, and I am 
> bulking insert (use python client) every 200 entries.  My cluster is in 
> Green status, running version  1.2.1. The index "number_of_shards" : 7, 
> "number_of_replicas" : 1
> But I keep getting read time out exception:
>
> Traceback (most recent call last):
>   File "reduce_dotcom_browse.test.py", line 95, in <module>
>     helpers.bulk(es, actions)
>   File "/usr/lib/python2.6/site-packages/elasticsearch/helpers.py", line 
> 148, in bulk
>     for ok, item in streaming_bulk(client, actions, **kwargs):
>   File "/usr/lib/python2.6/site-packages/elasticsearch/helpers.py", line 
> 107, in streaming_bulk
>     resp = client.bulk(bulk_actions, **kwargs)
>   File "/usr/lib/python2.6/site-packages/elasticsearch/client/utils.py", 
> line 70, in _wrapped
>     return func(*args, params=params, **kwargs)
>   File 
> "/usr/lib/python2.6/site-packages/elasticsearch/client/__init__.py", line 
> 568, in bulk
>     params=params, body=self._bulk_body(body))
>   File "/usr/lib/python2.6/site-packages/elasticsearch/transport.py", line 
> 274, in perform_request
>     status, headers, data = connection.perform_request(method, url, 
> params, body, ignore=ignore)
>   File 
> "/usr/lib/python2.6/site-packages/elasticsearch/connection/http_urllib3.py", 
> line 51, in perform_request
>     raise ConnectionError('N/A', str(e), e)
> elasticsearch.exceptions.ConnectionError: 
> ConnectionError(HTTPConnectionPool(host=u'10.93.80.216', port=9200): Read 
> timed out. (read timeout=10)) caused by: 
> ReadTimeoutError(HTTPConnectionPool(host=u'10.93.80.216', port=9200): Read 
> timed out. (read timeout=10))
>
> How can I trouble shoot this? In my opinion, bulk insert 200 entries 
> should be fairly easy..
> Thanks for any pointers.
> Chen
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/35340585-0675-4f2a-ba6a-95b95bc869db%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to