[GitHub] [beam] damccorm opened a new issue, #20679: ElasticsearchIO performs 0 division on DirectRunner

GitBox Sat, 04 Jun 2022 11:57:44 -0700


damccorm opened a new issue, #20679:
URL: https://github.com/apache/beam/issues/20679


   - Environment configuration
   
   In my company we use [Elasticsearch cross 
cluster](https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cross-cluster-search.html#ccs-supported-apis)
 setup for search. Cluster version is 7.9.1.
   
   I intended to use ElasticsearchIO for reading application logs and 
subsequently producing some aggregated data.
   - Problem description
    - In cross cluster ES setup, there is no `/<index>/_stats` API available, 
so it is not possible to compute 
[ElasticsearchIO#getEstimatedSizeBytes](https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L692)
 properly.
    - `statsJson` returned by the cluster looks like this:
   > Unknown macro: \{ "_shards" }
   > 
   > _all" :
   > nknown macro: \{ "primaries" }
   > 
   > total" : \{ }
   > ,
   > indices" : \{ }
   > 
   >  - That means that `totalCount` value cannot be parsed from the json and 
is thus set to `0`.
    - Which means that `estimatedByteSize` value [is set to 
1](https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L707)
 (Which itself is a workaround for similar issue.)
    - `ElasticsearchIO#getEstimatedSizeBytes` is used in 
[BoundedReadEvaluatorFactory#getInitialInputs](https://github.com/apache/beam/blob/master/runners/direct-java/src/main/java/org/apache/beam/runners/direct/BoundedReadEvaluatorFactory.java#L212)
 which does not check the value and performs division of two `long` values, 
which of course results in `0` for any `targetParallelism > 1`.
    - Then 
[ElasticsearchIO#split](https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L665)
 is called with `indexSize = 1` and `desiredBundleSizeBytes = 1`. Which sets 
`nbBundlesFloat` value to infinity.
    - Even though the number of bundles is ceiled at `1024`, reading from 1024 
BoundedElasticsearchSources concurrently makes the ElasticsearchIO virtually 
impossible to use on direct runner.
   
   - Resolution suggestion
   
   I still haven't tested reading from ElasticsearchIO on proper runner (we use 
flink 1.10.2), so I cannot either confirm or deny its functionality on our 
elastic setup. At the moment I'm just suggesting few checks of input values so 
the zero division and unnecessary parallelism problems are eliminated on direct 
runner.
   
   Imported from Jira 
[BEAM-10945](https://issues.apache.org/jira/browse/BEAM-10945). Original Jira 
may contain additional context.
   Reported by: DraCzech.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] damccorm opened a new issue, #20679: ElasticsearchIO performs 0 division on DirectRunner

Reply via email to