GitHub user dhruve opened a pull request:

    https://github.com/apache/spark/pull/22288

    [SPARK-22148] Acquire new executors to avoid hang because of blacklisting

    ## What changes were proposed in this pull request?
    Every time a task is unschedulable because of the condition where no. of 
task failures < no. of executors available, we currently abort the taskSet - 
failing the job. This change tries to acquire new executors if 
dynamicAllocation is turned on so that we can complete the job successfully.
    
    ## How was this patch tested?
    
    I performed some manual tests to check and validate the behavior. 
    
    ```scala
    val rdd = sc.parallelize(Seq(1 to 10), 3)
    
    import org.apache.spark.TaskContext
    
    val mapped = rdd.mapPartitionsWithIndex ( (index, iterator) => { if (index 
== 2) { Thread.sleep(30 * 1000); val attemptNum = 
TaskContext.get.attemptNumber; if (attemptNum < 3) throw new Exception("Fail 
for blacklisting")};  iterator.toList.map (x => x + " -> " + index).iterator } )
    
    mapped.collect
    ```
    
    Note: I am putting up this PR as initial draft to review the approach. 
    
    Todo List:
    - Add unit tests
    - Agree upon the conf name & value and update the docs 
    
    We can build on this approach further by:
    - Taking into account static allocation
    - Querying the RM to figure out if its a small cluster, then try to wait 
some more or abort immediately.
    - Try to distinguish between waiting for time while you acquire an executor 
and time for being unable to schedule a task.
    
    Open to suggestions.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dhruve/spark bug/SPARK-22148

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22288.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22288
    
----
commit 5253b3134119b2a28cdaa1406d7bafb55f62cbc1
Author: Dhruve Ashar <dhruveashar@...>
Date:   2018-08-30T18:08:58Z

    [SPARK-22148] Acquire new executors to avoid hang because of blacklisting

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to