Rob,
Even with 7 shards, each shard has around 100G data. I don't think I can
achieve "each shard should be around 20 - 30 gb in size"
I am using a file for testing, so its actually indexing sequentially for
every 200 entries. When I run cat thread:\
curl
'localhost:9200/_cat/thread_pool?v&h=id,host,bulk.active,bulk.rejected,bulk.completed,bulk.queue,bulk.queueSize'
id host bulk.active bulk.rejected bulk.completed bulk.queue
bulk.queueSize
-fmG es-trgt01 0 15901 13024036 0
50
Bp9R es-trgt04 0 41 10806286 0
50
lB0j es-trgt02 0 0 6412 0
50
tW2Z es-trgt05 0 4 11000638 0
50
_qPw es-trgt06 4 0 8286 25
50
csxB es-trgt03 0 0 8314 0
50
ah7F es-trgt00 0 2200 9978972 0
50
It does show large amount of rejections, but none of the queue reaches its
queue size (50). Why would indexing fail in such case?
Another thing worthy of mention is that the documents i am indexing are
child documents. Does this affect the bulk behavior at all?
I am going to lower the heap size to see whether it helps.
Thanks,
Chen
On Monday, November 17, 2014 10:55:03 PM UTC-8, Robert Gardam wrote:
>
> There are a few things going on here.
>
> When you say 200 entries, is this per second?? It might be that it's
> chunking them into 200 docs, but you're really just smashing it with more
> than you think. -
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cat-thread-pool.html
>
> This doc can show you what the different thread pools are doing. If you
> notice that it's rejecting large amounts of documents you might find you're
> bulk queue is too low. Increasing this might help, but be a little bit
> careful. If you increase it to something too huge you can easily break ES.
>
> First thing is it's always a good idea to not go above 32gb of heap. By
> doing so you disable compressed pointers and memory can run away. The file
> system cache will happily consume the rest of your memory.
>
> you could also not have any replicas while doing the bulk import and then
> increase this after the import has completed. That way you're not writing
> out replicas while trying to bulk import. Replicas are only useful for
> reads not indexing.
>
> Another important thing is working out your mappings for your index. Are
> you enumerating all fields or are there fields that don't require full text
> searching etc?
>
> What is the refresh rate on this index? It could be trying to refresh the
> index and is getting busy doing this. Although I wouldn't expect a cluster
> of this size to be having trouble indexing 200 with this. (if its 200 per
> second)
>
> I've also found that running the same number of shards as nodes can have a
> bad impact on the cluster as all nodes are busy trying to index and then
> they can't perform other cluster functions. - To give you an idea each
> shard should be around 20 - 30 gb in size
> Try reducing your shard count to 3 or maybe even 2 and then increase
> replicas.
>
> I hope this helps.
>
> Cheers,
> Rob
>
>
> On Monday, November 17, 2014 8:04:28 PM UTC+1, Chen Wang wrote:
>>
>> Hey, Guys,
>> I am loading a hive table of around 10million records into ES regularly.
>> Each document is small with 5-6 attributes. My Es cluster has 7 nodes, each
>> has 4 core and 128G. ES was allocated with 60% of the memory, and I am
>> bulking insert (use python client) every 200 entries. My cluster is in
>> Green status, running version 1.2.1. The index "number_of_shards" : 7,
>> "number_of_replicas" : 1
>> But I keep getting read time out exception:
>>
>> Traceback (most recent call last):
>> File "reduce_dotcom_browse.test.py", line 95, in <module>
>> helpers.bulk(es, actions)
>> File "/usr/lib/python2.6/site-packages/elasticsearch/helpers.py", line
>> 148, in bulk
>> for ok, item in streaming_bulk(client, actions, **kwargs):
>> File "/usr/lib/python2.6/site-packages/elasticsearch/helpers.py", line
>> 107, in streaming_bulk
>> resp = client.bulk(bulk_actions, **kwargs)
>> File "/usr/lib/python2.6/site-packages/elasticsearch/client/utils.py",
>> line 70, in _wrapped
>> return func(*args, params=params, **kwargs)
>> File
>> "/usr/lib/python2.6/site-packages/elasticsearch/client/__init__.py", line
>> 568, in bulk
>> params=params, body=self._bulk_body(body))
>> File "/usr/lib/python2.6/site-packages/elasticsearch/transport.py",
>> line 274, in perform_request
>> status, headers, data = connection.perform_request(method, url,
>> params, body, ignore=ignore)
>> File
>> "/usr/lib/python2.6/site-packages/elasticsearch/connection/http_urllib3.py",
>> line 51, in perform_request
>> raise ConnectionError('N/A', str(e), e)
>> elasticsearch.exceptions.ConnectionError:
>> ConnectionError(HTTPConnectionPool(host=u'10.93.80.216', port=9200): Read
>> timed out. (read timeout=10)) caused by:
>> ReadTimeoutError(HTTPConnectionPool(host=u'10.93.80.216', port=9200): Read
>> timed out. (read timeout=10))
>>
>> How can I trouble shoot this? In my opinion, bulk insert 200 entries
>> should be fairly easy..
>> Thanks for any pointers.
>> Chen
>>
>>
>>
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/cd19a9b0-6054-4cdf-ba93-536b874d92fb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.