[GitHub] spark pull request #22018: [SPARK-25038][SQL] Get block location in parallel

MaxGekk Thu, 09 Aug 2018 00:01:56 -0700

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22018#discussion_r208824418
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala
 ---
    @@ -297,7 +297,7 @@ object InMemoryFileIndex extends Logging {
         val missingFiles = mutable.ArrayBuffer.empty[String]
         val filteredLeafStatuses = allLeafStatuses.filterNot(
           status => shouldFilterOut(status.getPath.getName))
    -    val resolvedLeafStatuses = filteredLeafStatuses.flatMap {
    +    val resolvedLeafStatuses = filteredLeafStatuses.par.flatMap {
    --- End diff --
    
    Parallel Scala collections are not interruptible in some cases as a 
consequence of that if you use them on executors, tasks cannot be canceled 
properly. You can do an experiment yourself and run the code in a lambda 
function: 
https://github.com/apache/spark/blob/131ca146ed390cd0109cd6e8c95b61e418507080/core/src/test/scala/org/apache/spark/util/ThreadUtilsSuite.scala#L143-L150
    
    When you cancel the job, threads will be still blocking on the sleep call.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22018: [SPARK-25038][SQL] Get block location in parallel

Reply via email to