Hi Everyone, I am using the following configuration
2 Nodes, Number of Shards: 4, Number of Replicas: 0 I am currently indexing 50,000 (50K) files using pyelasticsearch of size amounting to 6 GB. For indexing I am increasing the number of threads from 1 to 8 and each time I am getting an index having different size. Num Threads Time taken for Indexing Size of index on Node 1 Size of Index on node 2 1 4069.559 s 3.50 GB 3.22 GB 2 2236.544 s 4.61 GB 4.54 GB 4 1990.098 s 5.45 GB 5.31 GB 8 1965.987 s 2.94 GB 2.96 GB The mapping I am using is dtype: { "_source": {"enabled": False}, "_all": {"enabled": False}, "properties": { "filecontent": {"type": "string", "store": False}, "filename": {"type": "string", "index": "not_analyzed", "store": True}, "filepath": {"type": "string", "index": "not_analyzed", "store": True}, "filetype": {"type": "string", "index": "not_analyzed", "store": True}, "tokens": {"type": "string", "store": True}, "rules": {"type": "string", "store": True} } } where in FIELD "filecontent" I am passing extracted text of the file which I got from using Tika for Field "tokens" I am storing some values I get from the text by running my regex and based on my values I populate Field "rules" My question is why there is a discrepancy in size of index formed when I just changing number of threads to send indexing requests. Please note: After Indexing has been completed, I am letting ES to cool down so that merging of segments can be achieved. Please let me know why the discrepancy in Index size Thanks, Lavesh -- This message contains confidential information and is intended only for the individual to whom it is addressed. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and permanently delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, late or incomplete, or could contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. If verification is required, please request a hard-copy version from the sender. Druva, www.druva.com -- Please update your bookmarks! We have moved to https://discuss.elastic.co/ --- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/122a8818-5217-41f2-ab65-316191f1aa7e%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.