Re: Dynamic resource allocation for structured streaming [SPARK-24815]

Martin Andersson Mon, 14 Aug 2023 06:58:56 -0700

IMO, using any kind of machine learning or AI for DRA is overkill. The effort 
involved would be considerable and likely counterproductive, compared to a more 
conventional approach of comparing the rate of incoming stream data with the 
effort of handling previous data rates.
________________________________
From: Mich Talebzadeh <mich.talebza...@gmail.com>
Sent: Tuesday, August 8, 2023 19:59
To: Pavan Kotikalapudi <pkotikalap...@twilio.com>
Cc: dev@spark.apache.org <dev@spark.apache.org>
Subject: Re: Dynamic resource allocation for structured streaming [SPARK-24815]

EXTERNAL SENDER. Do not click links or open attachments unless you recognize 
the sender and know the content is safe. DO NOT provide your username or 
password.

I am currently contemplating and sharing my thoughts openly. Considering our 
reliance on previously collected statistics (as mentioned earlier), it raises 
the question of why we couldn't integrate certain machine learning elements 
into Spark Structured Streaming? While this might slightly deviate from our 
current topic, I am not an expert in machine learning. However, there are 
individuals who possess the expertise to assist us in exploring this avenue.

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom

[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.

On Tue, 8 Aug 2023 at 18:01, Pavan Kotikalapudi 
<pkotikalap...@twilio.com<mailto:pkotikalap...@twilio.com>> wrote:
Listeners are the best resources to the allocation manager  afaik... It already 
has 
SparkListener<https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L640>
 that it utilizes. We can use it to extract more information (like processing 
times).
The one with more information regarding streaming query resides in sql 
module<https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala>
 though.

Thanks

Pavan

On Tue, Aug 8, 2023 at 5:43 AM Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:
Hi Pavan or anyone else

Is there any way one access the matrix displayed on SparkGUI? For example the 
readings for processing time? Can these be acessed?

Thanks

For example,
Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom

[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile<https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d6VrCySTg$>

https://en.everybodywiki.com/Mich_Talebzadeh<https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d4r4xOqSg$>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.

On Tue, 8 Aug 2023 at 06:44, Pavan Kotikalapudi 
<pkotikalap...@twilio.com<mailto:pkotikalap...@twilio.com>> wrote:
Thanks for the review Mich,

Yes, the configuration parameters we end up setting would be based on the 
trigger interval.

> If you are going to have additional indicators why not look at scheduling 
> delay as well
Yes. The implementation is based on scheduling delays, not for pending tasks of 
the current stage but rather pending tasks of all the stages in a 
micro-batch<https://urldefense.com/v3/__https://github.com/apache/spark/pull/42352/files*diff-fdddb0421641035be18233c212f0e3ccd2d6a49d345bd0cd4eac08fc4d911e21R1025__;Iw!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d6feoFH2Q$>
 (hence trigger interval).

> we ought to utilise the historical statistics collected under the 
> checkpointing directory to get more accurate statistics
You are right! This is just a simple implementation based on one factor, we 
should also look into other indicators as well If that would help build a 
better scaling algorithm.

Thank you,

Pavan

On Mon, Aug 7, 2023 at 9:55 PM Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:
Hi,

I glanced over the design doc.

You are providing certain configuration parameters plus some settings based on 
static values. For example:

spark.dynamicAllocation.schedulerBacklogTimeout": 54s

I cannot see any use of <processing time> which ought to be at least half of 
the batch interval to have the correct margins (confidence level). If you are 
going to have additional indicators why not look at scheduling delay as well. 
Moreover most of the needed statistics are also available to set accurate 
values. My inclination is that this is a great effort but we ought to utilise 
the historical statistics collected under checkpointing directory to get more 
accurate statistics. I will review the design document in duew course

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom

[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile<https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLFr9YSZnw$>

https://en.everybodywiki.com/Mich_Talebzadeh<https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLEPx44C1w$>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.

On Tue, 8 Aug 2023 at 01:30, Pavan Kotikalapudi 
<pkotikalap...@twilio.com.invalid> wrote:
Hi Spark Dev,

I have extended traditional DRA to work for structured streaming use-case.

Here is an initial Implementation draft PR 
https://github.com/apache/spark/pull/42352<https://urldefense.com/v3/__https://github.com/apache/spark/pull/42352__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLHLe7WCUw$>
 and design doc: 
https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing<https://urldefense.com/v3/__https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLFAjJfilg$>

Please review and let me know what you think.

Thank you,

Pavan

Re: Dynamic resource allocation for structured streaming [SPARK-24815]

Reply via email to