Github user xuanyuanking commented on the issue:
https://github.com/apache/spark/pull/20675
Great thanks for your detailed reply!
> The semantics aren't quite right. Task-level retry can happen a fixed
number of times for the lifetime of the task, which is the lifetime of the
query - even if it runs for days after, the attempt number will never be reset.
- I think the attempt number never be reset is not a problem, as long as
the task start with right epoch and offset. Maybe I don't understand the
meaning of the semantics, could you please give more explain?
- As far as I'm concerned, while we have a larger parallel number, whole
stage restart is a too heavy operation and will lead a data shaking.
- Also want to leave a further thinking, after CP support shuffle and more
complex scenario, task level retry need more work to do in order to ensure data
is correct. But it maybe still a useful feature? I just want to leave this
patch and initiate a discussion about this :)
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]