Kamal Gurala created SPARK-20054:
------------------------------------

             Summary: [Mesos] Detectability for resource starvation
                 Key: SPARK-20054
                 URL: https://issues.apache.org/jira/browse/SPARK-20054
             Project: Spark
          Issue Type: Improvement
          Components: Mesos, Scheduler
    Affects Versions: 2.1.0, 2.0.2, 2.0.1, 2.0.0
            Reporter: Kamal Gurala
            Priority: Minor


We currently use Mesos 1.1.0 for our Spark cluster in coarse-grained mode. We 
had a production issue recently wherein we had our spark frameworks accept 
resources from the Mesos master, so executors were started and spark driver was 
aware of them, but the driver didn’t plan any task and nothing was happening 
for a long time because it didn't meet a minimum registered resources 
threshold. and the cluster is usually under-provisioned in order because not 
all the jobs need to run at the same time. These held resources were never 
offered back to the master for re-allocation leading to the entire cluster to a 
halt until we had to manually intervene. 

Using DRF for mesos and FIFO for Spark and the cluster is usually 
under-provisioned. At any point of time there could be 10-15 spark frameworks 
running on Mesos on the under-provisioned cluster 

The ask is to have a way to better recoverability or detectability for a 
scenario where the individual Spark frameworks hold onto resources but never 
launch any tasks or have these frameworks release these resources after a fixed 
amount of time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to