[GitHub] spark pull request: [SPARK-4349] Checking if parallel collection p...

mccheah Fri, 14 Nov 2014 16:20:23 -0800

Github user mccheah commented on the pull request:

    https://github.com/apache/spark/pull/3275#issuecomment-63149863
  
    Please consider the design issues that I think this bug uncovers before 
providing comment on the PR.
    
    From what I understand, the original design was to catch task serialization 
errors in the DAGScheduler up front. I'm not sure if this is the right way to 
handle these errors. Why shouldn't the TaskSetManager be able to correctly 
handle serialization failures? We're essentially doing an extra unnecessary 
serialization call in the DAGScheduler, which is ultimately wasted effort when 
the Task Set Manager is going to serialize the tasks anyways.
    
    Also this uncovers the flawed assumption that, just because one task is 
serializable then all tasks in the task set are serializable. I'm not sure if 
we should rely on that assumption, since custom RDDs and custom implementations 
of Partition may not be universally serializable. We should revisit this 
assumption.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-4349] Checking if parallel collection p...

Reply via email to