EnricoMi opened a new pull request #27377: [WIP][SPARK-30666][Core] Reliable single-stage accumulators URL: https://github.com/apache/spark/pull/27377 ### What changes were proposed in this pull request? This PR introduces the notion of an accumulator mode which allows to select how accumulators behave when they merge accumulator values from partitions. The current behaviour is to merge in `ALL` value from partitions. When partitions are executed multiple times (e.g. re-run on failure or cache eviction, rerunning actions, usage of a stage in multiple actions), accumulators count all increments and cannot be used to count while ignoring re-computations of partitions. The added `ALL` mode is the default mode. The `MAX` mode keeps the most-comprehensive accumulator value for each partition. In case of a `LongAccumulator`, this reflects the largest seen value. A `LongAccumulator` that counts the number of rows or nulls in a dataset provides accurate results in this mode. The `LAST` mode keeps the latest accumulator per partition only. This provides the latest value of the accumulator. The accumulator on the driver uses additional memory in the order of the number of partitions. ### Why are the changes needed? With the current behaviour, only a very limited class of accumulator use cases can be implemented, those that count across partition executions. Counting read and written bytes is a good example where this behaviour is desired. Counting the number of rows of the dataset, or rows that meet a certain condition cannot be implemented with this behaviour. The accumulator over-counts and thus only provides a pessimistic upper bound of the true value. ### Does this PR introduce any user-facing change? Yes, it introduces an `AccumulatorMode` enum that can be used when accumulators get registered to pick the merge behaviour. ### How was this patch tested? Unit tests in the `AccumulatorSuite`.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
