JeffBolle opened a new issue, #24117:
URL: https://github.com/apache/beam/issues/24117

   ### What happened?
   
   In the process of investigating the issue reported here:
   
https://stackoverflow.com/questions/74390325/how-to-enable-elasticsearchio-parallel-reads-in-apache-beam
   
   it appears that the method used by the ElasticsearchIO connector to get the 
estimated size of the data in the response is not accounting for the case where 
the configured index is an alias or a datastream or an index pattern which can 
point to multiple indexes.
   
   The original issue was a query that returns over 100 million documents for 
processing in the pipeline was unable to scale and was only processing at a 
rate of 40 / second.
   
   As discussed in the stackoverflow thread, the code here: 
https://github.com/apache/beam/blob/c7f2cab6ea30a63e04847dc45047a8193abc9552/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L871
   
   is not properly accounting for a number of scenarios where the index name 
returned by ElasticSearch is different than 
`connectionConfiguration.getIndex()`. 
   
   ElasticSearch should be relied upon to return the proper indexes for a given 
stats query, and as such the `_all` object should be used instead of the 
`indicies` top level object.  If there are other cases where the `_all` object 
isn't appropriate, then the code should iterate through all of the indicies 
returned under the `indices` field and sum the total store size, and not simply 
try to match the configured index.
   
   ### Issue Priority
   
   Priority: 2
   
   ### Issue Component
   
   Component: io-java-elasticsearch


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to