[GitHub] spark pull request #15505: [SPARK-18890][CORE] Move task serialization from ...

squito Mon, 09 Jan 2017 09:14:16 -0800

Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/15505#discussion_r95198044
  
    --- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala ---
    @@ -52,7 +55,36 @@ private[spark] class TaskDescription(
         val addedFiles: Map[String, Long],
         val addedJars: Map[String, Long],
         val properties: Properties,
    -    val serializedTask: ByteBuffer) {
    +    private var serializedTask_ : ByteBuffer,
    +    private var task_ : Task[_] = null) extends  Logging {
    +
    +  def this(
    +      taskId: Long,
    +      attemptNumber: Int,
    +      executorId: String,
    +      name: String,
    +      index: Int, // Index within this task's TaskSet
    +      addedFiles: Map[String, Long],
    +      addedJars: Map[String, Long],
    +      properties: Properties,
    +      task: Task[_]) {
    +    this(taskId, attemptNumber, executorId, name, index,
    +      addedFiles, addedJars, properties, null, task)
    +  }
    +
    +  lazy val serializedTask: ByteBuffer = {
    +    if (serializedTask_ == null) {
    +      serializedTask_ = try {
    +        ByteBuffer.wrap(Utils.serialize(task_))
    +      } catch {
    +        case NonFatal(e) =>
    +          val msg = s"Failed to serialize task $taskId, not attempting to 
retry it."
    +          logError(msg, e)
    +          throw new TaskNotSerializableException(e)
    +      }
    +    }
    +    serializedTask_
    --- End diff --
    
    can you include a comment here about the purpose of this?  Maybe just 
inside the `if (serialized_ == null)`, something like "This is where we 
serialize the task on the driver before sending it to the executor.  This is 
not done when creating the TaskDescription so we can postpone this 
serialization to later in the scheduling process -- particularly, so it can 
happen in another thread by the CoarseGrainedSchedulerBackend.  On the 
executors, this will already be populated by `decode`".
    
    Also, I feel like this would be more clear as a `def`, rather than a `lazy 
val`.  Since its a lazy val I keep thinking this itself is going to be 
serialized, and then I have to remind myself its just used by encode and the 
serialization is manually managed.  The only advantage of `lazy val` is 
automatic thread-safety, but that's irrelevant here.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #15505: [SPARK-18890][CORE] Move task serialization from ...

Reply via email to