EnricoMi opened a new pull request #27377: [WIP][SPARK-30666][Core] Reliable 
single-stage accumulators
URL: https://github.com/apache/spark/pull/27377
 
 
   ### What changes were proposed in this pull request?
   This PR introduces the notion of an accumulator mode which allows to select 
how accumulators behave when they merge accumulator values from partitions.
   
   The current behaviour is to merge in `ALL` value from partitions. When 
partitions are executed multiple times (e.g. re-run on failure or cache 
eviction, rerunning actions, usage of a stage in multiple actions), 
accumulators count all increments and cannot be used to count while ignoring 
re-computations of partitions. The added `ALL` mode is the default mode.
   
   The `MAX` mode keeps the most-comprehensive accumulator value for each 
partition. In case of a `LongAccumulator`, this reflects the largest seen 
value. A `LongAccumulator` that counts the number of rows or nulls in a dataset 
provides accurate results in this mode.
   
   The `LAST` mode keeps the latest accumulator per partition only. This 
provides the latest value of the accumulator.
   
   The accumulator on the driver uses additional memory in the order of the 
number of partitions.
   
   ### Why are the changes needed?
   With the current behaviour, only a very limited class of accumulator use 
cases can be implemented, those that count across partition executions. 
Counting read and written bytes is a good example where this behaviour is 
desired. Counting the number of rows of the dataset, or rows that meet a 
certain condition cannot be implemented with this behaviour. The accumulator 
over-counts and thus only provides a pessimistic upper bound of the true value.
   
   ### Does this PR introduce any user-facing change?
   Yes, it introduces an `AccumulatorMode` enum that can be used when 
accumulators get registered to pick the merge behaviour.
   
   ### How was this patch tested?
   Unit tests in the `AccumulatorSuite`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to