[GitHub] spark pull request #20601: [SPARK-23413][UI] Fix sorting tasks by Host / Exe...

vanzin Wed, 14 Feb 2018 00:40:08 -0800

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20601#discussion_r168105632
  
    --- Diff: core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala ---
    @@ -963,33 +965,60 @@ private[ui] class TaskPagedTable(
     
     private object ApiHelper {
     
    +  val HEADER_ID = "ID"
    +  val HEADER_TASK_INDEX = "Index"
    +  val HEADER_ATTEMPT = "Attempt"
    +  val HEADER_STATUS = "Status"
    +  val HEADER_LOCALITY = "Locality Level"
    +  val HEADER_EXECUTOR = "Executor ID"
    +  val HEADER_HOST = "Host"
    +  val HEADER_LAUNCH_TIME = "Launch Time"
    +  val HEADER_DURATION = "Duration"
    +  val HEADER_SCHEDULER_DELAY = "Scheduler Delay"
    +  val HEADER_DESER_TIME = "Task Deserialization Time"
    +  val HEADER_GC_TIME = "GC Time"
    +  val HEADER_SER_TIME = "Result Serialization Time"
    +  val HEADER_GETTING_RESULT_TIME = "Getting Result Time"
    +  val HEADER_PEAK_MEM = "Peak Execution Memory"
    +  val HEADER_ACCUMULATORS = "Accumulators"
    +  val HEADER_INPUT_SIZE = "Input Size / Records"
    +  val HEADER_OUTPUT_SIZE = "Output Size / Records"
    +  val HEADER_SHUFFLE_READ_TIME = "Shuffle Read Blocked Time"
    +  val HEADER_SHUFFLE_TOTAL_READS = "Shuffle Read Size / Records"
    +  val HEADER_SHUFFLE_REMOTE_READS = "Shuffle Remote Reads"
    +  val HEADER_SHUFFLE_WRITE_TIME = "Write Time"
    +  val HEADER_SHUFFLE_WRITE_SIZE = "Shuffle Write Size / Records"
    +  val HEADER_MEM_SPILL = "Shuffle Spill (Memory)"
    +  val HEADER_DISK_SPILL = "Shuffle Spill (Disk)"
    +  val HEADER_ERROR = "Errors"
     
       private val COLUMN_TO_INDEX = Map(
    -    "ID" -> null.asInstanceOf[String],
    -    "Index" -> TaskIndexNames.TASK_INDEX,
    -    "Attempt" -> TaskIndexNames.ATTEMPT,
    -    "Status" -> TaskIndexNames.STATUS,
    -    "Locality Level" -> TaskIndexNames.LOCALITY,
    -    "Executor ID / Host" -> TaskIndexNames.EXECUTOR,
    -    "Launch Time" -> TaskIndexNames.LAUNCH_TIME,
    -    "Duration" -> TaskIndexNames.DURATION,
    -    "Scheduler Delay" -> TaskIndexNames.SCHEDULER_DELAY,
    -    "Task Deserialization Time" -> TaskIndexNames.DESER_TIME,
    -    "GC Time" -> TaskIndexNames.GC_TIME,
    -    "Result Serialization Time" -> TaskIndexNames.SER_TIME,
    -    "Getting Result Time" -> TaskIndexNames.GETTING_RESULT_TIME,
    -    "Peak Execution Memory" -> TaskIndexNames.PEAK_MEM,
    -    "Accumulators" -> TaskIndexNames.ACCUMULATORS,
    -    "Input Size / Records" -> TaskIndexNames.INPUT_SIZE,
    -    "Output Size / Records" -> TaskIndexNames.OUTPUT_SIZE,
    -    "Shuffle Read Blocked Time" -> TaskIndexNames.SHUFFLE_READ_TIME,
    -    "Shuffle Read Size / Records" -> TaskIndexNames.SHUFFLE_TOTAL_READS,
    -    "Shuffle Remote Reads" -> TaskIndexNames.SHUFFLE_REMOTE_READS,
    -    "Write Time" -> TaskIndexNames.SHUFFLE_WRITE_TIME,
    -    "Shuffle Write Size / Records" -> TaskIndexNames.SHUFFLE_WRITE_SIZE,
    -    "Shuffle Spill (Memory)" -> TaskIndexNames.MEM_SPILL,
    -    "Shuffle Spill (Disk)" -> TaskIndexNames.DISK_SPILL,
    -    "Errors" -> TaskIndexNames.ERROR)
    +    HEADER_ID -> null.asInstanceOf[String],
    +    HEADER_TASK_INDEX -> TaskIndexNames.TASK_INDEX,
    +    HEADER_ATTEMPT -> TaskIndexNames.ATTEMPT,
    +    HEADER_STATUS -> TaskIndexNames.STATUS,
    +    HEADER_LOCALITY -> TaskIndexNames.LOCALITY,
    +    HEADER_EXECUTOR -> TaskIndexNames.EXECUTOR,
    +    HEADER_HOST -> TaskIndexNames.EXECUTOR,
    --- End diff --
    
    Hmmm... I agree that the correct thing would be to sort by host and have an 
index on that. The problem is that we'd be changing the data on disk, breaking 
compatibility with previous versions of the disk store. So unless that change 
goes into 2.3.0, that means revving the disk version number, which would 
require re-parsing all logs. And that kinda sucks. (I hope by the next major 
version I - or someone - get time to better investigate versioning of the disk 
data.)
    
    Given this affects 2.3 we could potentially consider it a blocker. 
@sameeragarwal probably won't be very happy though.
    
    2.2 actually sorts by executor id, and doesn't have a separate host column 
(added in SPARK-21675). That's one of these small changes I missed while 
merging all the SHS stuff.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20601: [SPARK-23413][UI] Fix sorting tasks by Host / Exe...

Reply via email to