Github user markhamstra commented on the pull request:
https://github.com/apache/spark/pull/4055#issuecomment-122122767
@suyanNone @squito Thanks for all the work you've done on this PR so far --
your diligence has been really helpful. Sorry that I haven't had time to
participate more.
I think that the four alternatives you've boiled this down to are a pretty
good summary of the available options. The only one I'm really opposed to is,
unfortunately, the `hashCode` and `equals` approach, i.e. Option 4). I'm in
agreement with Imran that the `hashCode` and `equals` should serve to identify
a `Task` globally and uniquely. I don't like the idea that they will instead
effectively place a `Task` into an equivalence class with other `Task`s even if
that equivalence relation is useful in at least one important context. There
may be value in formalizing the idea of such a `Task` equivalence class, and
that there be something like `EquivalentTasks` containers whose job is to hold
equivalent `Task`s, but I don't think we need to or should go that far in this
particular PR.
On the other hand, I do like the first alternative, since
`pendingPartitions` is only used within the `DAGScheduler` and (except for one
log message) we never use the `Task`s held in the `Set[Task]` as actual
`Task`s, but only intend to use them as markers for partition results or number
of results needing to be computed. The change to track `pendingPartitions` or
`pendingPartitionResults` as `Set[Int]` seems like a good and simple one to me
-- especially if we do a tiny bit more work to retain/recover all or most of
the useful debugging information we will lose in that one log message.
Option 3) or 4) may be worth doing in addition to 1) even though not
necessary to fix SPARK-5259, but rather just to clarify how the code works --
i.e. something like 3) or 4) may be worth doing if it actually would make the
code clearer and easier to understand and would also clearly not introduce
additional complexity that would risk unintended side-effects.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]