Github user markhamstra commented on the pull request:

    https://github.com/apache/spark/pull/4055#issuecomment-122122767
  
    @suyanNone @squito Thanks for all the work you've done on this PR so far -- 
your diligence has been really helpful.  Sorry that I haven't had time to 
participate more.
    
    I think that the four alternatives you've boiled this down to are a pretty 
good summary of the available options.  The only one I'm really opposed to is, 
unfortunately, the `hashCode` and `equals` approach, i.e. Option 4).  I'm in 
agreement with Imran that the `hashCode` and `equals` should serve to identify 
a `Task` globally and uniquely.  I don't like the idea that they will instead 
effectively place a `Task` into an equivalence class with other `Task`s even if 
that equivalence relation is useful in at least one important context.  There 
may be value in formalizing the idea of such a `Task` equivalence class, and 
that there be something like `EquivalentTasks` containers whose job is to hold 
equivalent `Task`s, but I don't think we need to or should go that far in this 
particular PR.
    
    On the other hand, I do like the first alternative, since 
`pendingPartitions` is only used within the `DAGScheduler` and (except for one 
log message) we never use the `Task`s held in the `Set[Task]` as actual 
`Task`s, but only intend to use them as markers for partition results or number 
of results needing to be computed.  The change to track `pendingPartitions` or 
`pendingPartitionResults` as `Set[Int]` seems like a good and simple one to me 
-- especially if we do a tiny bit more work to retain/recover all or most of 
the useful debugging information we will lose in that one log message.
    
    Option 3) or 4) may be worth doing in addition to 1) even though not 
necessary to fix SPARK-5259, but rather just to clarify how the code works -- 
i.e. something like 3) or 4) may be worth doing if it actually would make the 
code clearer and easier to understand and would also clearly not introduce 
additional complexity that would risk unintended side-effects. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to