Re: [DISCUSSION] SPIP: Support Volcano/Alternative Schedulers Proposal

Mich Talebzadeh Wed, 05 Jan 2022 11:51:42 -0800

+1 non-binding



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 5 Jan 2022 at 19:16, Holden Karau <[email protected]> wrote:

> Do we want to move the SPIP forward to a vote? It seems like we're mostly
> agreeing in principle?
>
> On Wed, Jan 5, 2022 at 11:12 AM Mich Talebzadeh <[email protected]>
> wrote:
>
>> Hi Bo,
>>
>> Thanks for the info. Let me elaborate:
>>
>> In theory you can set the number of executors to multiple values of
>> Nodes. For example if you have a three node k8s cluster (in my case Google
>> GKE), you can set the number of executors to 6 and end up with six
>> executors queuing to start but ultimately you finish with two running
>> executors plus the driver in a 3 node cluster as shown below
>>
>> hduser@ctpvm: /home/hduser> k get pods -n spark
>>
>> NAME                                         READY   STATUS    RESTARTS
>>  AGE
>>
>> *randomdatabigquery-d42d067e2b91c88a-exec-1   1/1     Running   0
>>   33s*
>>
>> *randomdatabigquery-d42d067e2b91c88a-exec-2   1/1     Running   0
>>   33s*
>>
>> randomdatabigquery-d42d067e2b91c88a-exec-3   0/1     Pending   0
>> 33s
>>
>> randomdatabigquery-d42d067e2b91c88a-exec-4   0/1     Pending   0
>> 33s
>>
>> randomdatabigquery-d42d067e2b91c88a-exec-5   0/1     Pending   0
>> 33s
>>
>> randomdatabigquery-d42d067e2b91c88a-exec-6   0/1     Pending   0
>> 33s
>>
>> *sparkbq-0beda77e2b919e01-driver              1/1     Running   0
>>   45s*
>>
>> hduser@ctpvm: /home/hduser> k get pods -n spark
>>
>> NAME                                         READY   STATUS    RESTARTS
>>  AGE
>>
>> randomdatabigquery-d42d067e2b91c88a-exec-1   1/1     Running   0
>> 38s
>>
>> randomdatabigquery-d42d067e2b91c88a-exec-2   1/1     Running   0
>> 38s
>>
>> sparkbq-0beda77e2b919e01-driver              1/1     Running   0
>> 50s
>>
>> hduser@ctpvm: /home/hduser> k get pods -n spark
>>
>> *NAME                                         READY   STATUS    RESTARTS
>>  AGE*
>>
>> *randomdatabigquery-d42d067e2b91c88a-exec-1   1/1     Running   0
>>   40s*
>>
>> *randomdatabigquery-d42d067e2b91c88a-exec-2   1/1     Running   0
>>   40s*
>>
>> *sparkbq-0beda77e2b919e01-driver              1/1     Running   0
>>   52s*
>>
>> So you end up with the three added executors dropping out. Hence the
>> conclusion seems to be you want to fit exactly one Spark executor pod
>> per Kubernetes node with the current model.
>>
>> HTH
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 5 Jan 2022 at 17:01, bo yang <[email protected]> wrote:
>>
>>> Hi Mich,
>>>
>>> Curious what do you mean “The constraint seems to be that you can fit one
>>> Spark executor pod per Kubernetes node and from my tests you don't seem to
>>> be able to allocate more than 50% of RAM on the node to the container",
>>> Would you help to explain a bit? Asking this because there could be
>>> multiple executor pods running on a single Kuberentes node.
>>>
>>> Thanks,
>>> Bo
>>>
>>>
>>> On Wed, Jan 5, 2022 at 1:13 AM Mich Talebzadeh <
>>> [email protected]> wrote:
>>>
>>>> Thanks William for the info.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> The current model of Spark on k8s has certain drawbacks with pod based
>>>> scheduling as I  tested it on Google Kubernetes Cluster (GKE). The
>>>> constraint seems to be that you can fit one Spark executor pod per
>>>> Kubernetes node and from my tests you don't seem to be able to allocate
>>>> more than 50% of RAM on the node to the container.
>>>>
>>>>
>>>> [image: gke_memoeyPlot.png]
>>>>
>>>>
>>>> Anymore results in the container never been created (stuck at pending)
>>>>
>>>> kubectl describe pod sparkbq-b506ac7dc521b667-driver -n spark
>>>>
>>>>  Events:
>>>>
>>>>   Type     Reason             Age                   From                
>>>> Message
>>>>
>>>>   ----     ------             ----                  ----                
>>>> -------
>>>>
>>>>   Warning  FailedScheduling   17m                   default-scheduler   
>>>> 0/3 nodes are available: 3 Insufficient memory.
>>>>
>>>>   Warning  FailedScheduling   17m                   default-scheduler   
>>>> 0/3 nodes are available: 3 Insufficient memory.
>>>>
>>>>   Normal   NotTriggerScaleUp  2m28s (x92 over 17m)  cluster-autoscaler  
>>>> pod didn't trigger scale-up:
>>>>
>>>> Obviously this is far from ideal and this model although works is not
>>>> efficient.
>>>>
>>>>
>>>> Cheers,
>>>>
>>>>
>>>> Mich
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction
>>>>
>>>> of data or any other property which may arise from relying on this
>>>> email's technical content is explicitly disclaimed.
>>>>
>>>> The author will in no case be liable for any monetary damages arising
>>>> from such
>>>>
>>>> loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, 5 Jan 2022 at 03:55, William Wang <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Mich,
>>>>>
>>>>> Here are parts of performance indications in Volcano.
>>>>> 1. Scheduler throughput: 1.5k pod/s (default scheduler: 100 Pod/s)
>>>>> 2. Spark application performance improved 30%+ with minimal resource
>>>>> reservation feature in case of insufficient resource.(tested with TPC-DS)
>>>>>
>>>>> We are still working on more optimizations. Besides the performance,
>>>>> Volcano is continuously enhanced in below four directions to provide
>>>>> abilities that users care about.
>>>>> - Full lifecycle management for jobs
>>>>> - Scheduling policies for high-performance workloads(fair-share,
>>>>> topology, sla, reservation, preemption, backfill etc)
>>>>> - Support for heterogeneous hardware
>>>>> - Performance optimization for high-performance workloads
>>>>>
>>>>> Thanks
>>>>> LeiBo
>>>>>
>>>>> Mich Talebzadeh <[email protected]> 于2022年1月4日周二 18:12写道：
>>>>>
>>>> Interesting,thanks
>>>>>>
>>>>>> Do you have any indication of the ballpark figure (a rough numerical
>>>>>> estimate) of adding Volcano as an alternative scheduler is going to
>>>>>> improve Spark on k8s performance?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction
>>>>>>
>>>>>> of data or any other property which may arise from relying on this
>>>>>> email's technical content is explicitly disclaimed.
>>>>>>
>>>>>> The author will in no case be liable for any monetary damages arising
>>>>>> from such
>>>>>>
>>>>>> loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, 4 Jan 2022 at 09:43, Yikun Jiang <[email protected]> wrote:
>>>>>>
>>>>>>> Hi, folks! Wishing you all the best in 2022.
>>>>>>>
>>>>>>> I'd like to share the current status on "Support Customized K8S
>>>>>>> Scheduler in Spark".
>>>>>>>
>>>>>>>
>>>>>>> https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg/edit#heading=h.1quyr1r2kr5n
>>>>>>>
>>>>>>> Framework/Common support
>>>>>>>
>>>>>>> - Volcano and Yunikorn team join the discussion and complete the
>>>>>>> initial doc on framework/common part.
>>>>>>>
>>>>>>> - SPARK-37145 <https://issues.apache.org/jira/browse/SPARK-37145>
>>>>>>> (under reviewing): We proposed to extend the customized scheduler by 
>>>>>>> just
>>>>>>> using a custom feature step, it will meet the requirement of customized
>>>>>>> scheduler after it gets merged. After this, the user can enable 
>>>>>>> featurestep
>>>>>>> and scheduler like:
>>>>>>>
>>>>>>> spark-submit \
>>>>>>>
>>>>>>>     --conf spark.kubernete.scheduler.name volcano \
>>>>>>>
>>>>>>>     --conf spark.kubernetes.driver.pod.featureSteps
>>>>>>> org.apache.spark.deploy.k8s.features.scheduler.VolcanoFeatureStep
>>>>>>>
>>>>>>> --conf spark.kubernete.job.queue xxx
>>>>>>>
>>>>>>> (such as above, the VolcanoFeatureStep will help to set the the
>>>>>>> spark scheduler queue according user specified conf)
>>>>>>>
>>>>>>> - SPARK-37331 <https://issues.apache.org/jira/browse/SPARK-37331>:
>>>>>>> Added the ability to create kubernetes resources before driver pod 
>>>>>>> creation.
>>>>>>>
>>>>>>> - SPARK-36059 <https://issues.apache.org/jira/browse/SPARK-36059>:
>>>>>>> Add the ability to specify a scheduler in driver/executor
>>>>>>>
>>>>>>> After above all, the framework/common support would be ready for
>>>>>>> most of customized schedulers
>>>>>>>
>>>>>>> Volcano part:
>>>>>>>
>>>>>>> - SPARK-37258 <https://issues.apache.org/jira/browse/SPARK-37258>:
>>>>>>> Upgrade kubernetes-client to 5.11.1 to add volcano scheduler API 
>>>>>>> support.
>>>>>>>
>>>>>>> - SPARK-36061 <https://issues.apache.org/jira/browse/SPARK-36061>:
>>>>>>> Add a VolcanoFeatureStep to help users to create a PodGroup with user
>>>>>>> specified minimum resources required, there is also a WIP commit to
>>>>>>> show the preview of this
>>>>>>> <https://github.com/Yikun/spark/pull/45/commits/81bf6f98edb5c00ebd0662dc172bc73f980b6a34>
>>>>>>> .
>>>>>>>
>>>>>>> Yunikorn part:
>>>>>>>
>>>>>>> - @WeiweiYang is completing the doc of the Yunikorn part and
>>>>>>> implementing the Yunikorn part.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Yikun
>>>>>>>
>>>>>>>
>>>>>>> Weiwei Yang <[email protected]> 于2021年12月2日周四 02:00写道：
>>>>>>>
>>>>>>>> Thank you Yikun for the info, and thanks for inviting me to a
>>>>>>>> meeting to discuss this.
>>>>>>>> I appreciate your effort to put these together, and I agree that
>>>>>>>> the purpose is to make Spark easy/flexible enough to support other K8s
>>>>>>>> schedulers (not just for Volcano).
>>>>>>>> As discussed, could you please help to abstract out the things in
>>>>>>>> common and allow Spark to plug different implementations? I'd be happy 
>>>>>>>> to
>>>>>>>> work with you guys on this issue.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Nov 30, 2021 at 6:49 PM Yikun Jiang <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> @Weiwei @Chenya
>>>>>>>>>
>>>>>>>>> > Thanks for bringing this up. This is quite interesting, we
>>>>>>>>> definitely should participate more in the discussions.
>>>>>>>>>
>>>>>>>>> Thanks for your reply and welcome to join the discussion, I think
>>>>>>>>> the input from Yunikorn is very critical.
>>>>>>>>>
>>>>>>>>> > The main thing here is, the Spark community should make Spark
>>>>>>>>> pluggable in order to support other schedulers, not just for Volcano. 
>>>>>>>>> It
>>>>>>>>> looks like this proposal is pushing really hard for adopting PodGroup,
>>>>>>>>> which isn't part of K8s yet, that to me is problematic.
>>>>>>>>>
>>>>>>>>> Definitely yes, we are on the same page.
>>>>>>>>>
>>>>>>>>> I think we have the same goal: propose a general and reasonable
>>>>>>>>> mechanism to make spark on k8s with a custom scheduler more usable.
>>>>>>>>>
>>>>>>>>> But for the PodGroup, just allow me to do a brief introduction:
>>>>>>>>> - The PodGroup definition has been approved by Kubernetes
>>>>>>>>> officially in KEP-583. [1]
>>>>>>>>> - It can be regarded as a general concept/standard in Kubernetes
>>>>>>>>> rather than a specific concept in Volcano, there are also others to
>>>>>>>>> implement it, such as [2][3].
>>>>>>>>> - Kubernetes recommends using CRD to do more extension to
>>>>>>>>> implement what they want. [4]
>>>>>>>>> - Volcano as extension provides an interface to maintain the life
>>>>>>>>> cycle PodGroup CRD and use volcano-scheduler to complete the 
>>>>>>>>> scheduling.
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/583-coscheduling
>>>>>>>>> [2]
>>>>>>>>> https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/coscheduling#podgroup
>>>>>>>>> [3] https://github.com/kubernetes-sigs/kube-batch
>>>>>>>>> [4]
>>>>>>>>> https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Yikun
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Weiwei Yang <[email protected]> 于2021年12月1日周三 上午5:57写道：
>>>>>>>>>
>>>>>>>>>> Hi Chenya
>>>>>>>>>>
>>>>>>>>>> Thanks for bringing this up. This is quite interesting, we
>>>>>>>>>> definitely should participate more in the discussions.
>>>>>>>>>> The main thing here is, the Spark community should make Spark
>>>>>>>>>> pluggable in order to support other schedulers, not just for 
>>>>>>>>>> Volcano. It
>>>>>>>>>> looks like this proposal is pushing really hard for adopting 
>>>>>>>>>> PodGroup,
>>>>>>>>>> which isn't part of K8s yet, that to me is problematic.
>>>>>>>>>>
>>>>>>>>>> On Tue, Nov 30, 2021 at 9:21 AM Prasad Paravatha <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> This is a great feature/idea.
>>>>>>>>>>> I'd love to get involved in some form (testing and/or
>>>>>>>>>>> documentation). This could be my 1st contribution to Spark!
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Nov 30, 2021 at 10:46 PM John Zhuge <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> +1 Kudos to Yikun and the community for starting the discussion!
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Nov 30, 2021 at 8:47 AM Chenya Zhang <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks folks for bringing up the topic of natively integrating
>>>>>>>>>>>>> Volcano and other alternative schedulers into Spark!
>>>>>>>>>>>>>
>>>>>>>>>>>>> +Weiwei, Wilfred, Chaoran. We would love to contribute to the
>>>>>>>>>>>>> discussion as well.
>>>>>>>>>>>>>
>>>>>>>>>>>>> From our side, we have been using and improving on one
>>>>>>>>>>>>> alternative resource scheduler, Apache YuniKorn (
>>>>>>>>>>>>> https://yunikorn.apache.org/), for Spark on Kubernetes in
>>>>>>>>>>>>> production at Apple with solid results in the past year. It is 
>>>>>>>>>>>>> capable of
>>>>>>>>>>>>> supporting Gang scheduling (similar to PodGroups), multi-tenant 
>>>>>>>>>>>>> resource
>>>>>>>>>>>>> queues (similar to YARN), FIFO, and other handy features like bin 
>>>>>>>>>>>>> packing
>>>>>>>>>>>>> to enable efficient autoscaling, etc.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Natively integrating with Spark would provide more flexibility
>>>>>>>>>>>>> for users and reduce the extra cost and potential inconsistency of
>>>>>>>>>>>>> maintaining different layers of resource strategies. One 
>>>>>>>>>>>>> interesting topic
>>>>>>>>>>>>> we hope to discuss more about is dynamic allocation, which would 
>>>>>>>>>>>>> benefit
>>>>>>>>>>>>> from native coordination between Spark and resource schedulers in 
>>>>>>>>>>>>> K8s &
>>>>>>>>>>>>> cloud environment for an optimal resource efficiency.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Nov 30, 2021 at 8:10 AM Holden Karau <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for putting this together, I’m really excited for us
>>>>>>>>>>>>>> to add better batch scheduling integrations.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Nov 30, 2021 at 12:46 AM Yikun Jiang <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'd like to start a discussion on "Support
>>>>>>>>>>>>>>> Volcano/Alternative Schedulers Proposal".
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This SPIP is proposed to make spark k8s schedulers provide
>>>>>>>>>>>>>>> more YARN like features (such as queues and minimum resources 
>>>>>>>>>>>>>>> before
>>>>>>>>>>>>>>> scheduling jobs) that many folks want on Kubernetes.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The goal of this SPIP is to improve current spark k8s
>>>>>>>>>>>>>>> scheduler implementations, add the ability of batch scheduling 
>>>>>>>>>>>>>>> and support
>>>>>>>>>>>>>>> volcano as one of implementations.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Design doc:
>>>>>>>>>>>>>>> https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg
>>>>>>>>>>>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-36057
>>>>>>>>>>>>>>> Part of PRs:
>>>>>>>>>>>>>>> Ability to create resources
>>>>>>>>>>>>>>> https://github.com/apache/spark/pull/34599
>>>>>>>>>>>>>>> Add PodGroupFeatureStep:
>>>>>>>>>>>>>>> https://github.com/apache/spark/pull/34456
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Yikun
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>>>>>>>> YouTube Live Streams:
>>>>>>>>>>>>>> https://www.youtube.com/user/holdenkarau
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> John Zhuge
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Regards,
>>>>>>>>>>> Prasad Paravatha
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: [DISCUSSION] SPIP: Support Volcano/Alternative Schedulers Proposal

Reply via email to