Re: Dynamic resource allocation for structured streaming [SPARK-24815]

Pavan Kotikalapudi Tue, 08 Aug 2023 10:03:09 -0700

Listeners are the best resources to the allocation manager  afaik... It
already has SparkListener
<https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L640>
that
it utilizes. We can use it to extract more information (like processing
times).
The one with more information regarding streaming query resides in sql
module
<https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala>
though.


Thanks

Pavan

On Tue, Aug 8, 2023 at 5:43 AM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi Pavan or anyone else
>
> Is there any way one access the matrix displayed on SparkGUI? For example
> the readings for processing time? Can these be acessed?
>
> Thanks
>
> For example,
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d6VrCySTg$>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d4r4xOqSg$>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 8 Aug 2023 at 06:44, Pavan Kotikalapudi <pkotikalap...@twilio.com>
> wrote:
>
>> Thanks for the review Mich,
>>
>> Yes, the configuration parameters we end up setting would be based on the
>> trigger interval.
>>
>> > If you are going to have additional indicators why not look at
>> scheduling delay as well
>> Yes. The implementation is based on scheduling delays, not for pending
>> tasks of the current stage but rather pending tasks of all the stages in
>> a micro-batch
>> <https://urldefense.com/v3/__https://github.com/apache/spark/pull/42352/files*diff-fdddb0421641035be18233c212f0e3ccd2d6a49d345bd0cd4eac08fc4d911e21R1025__;Iw!!NCc8flgU!d-qX4RylsnHucGkE4OdsO8agaKMFV59tVQnWZL1FbbZLVLWVUWgWmiiKC1Mvyy-796X-uP5XZfjLEbrVfe771d6feoFH2Q$>
>>  (hence
>> trigger interval).
>>
>> > we ought to utilise the historical statistics collected under the
>> checkpointing directory to get more accurate statistics
>> You are right! This is just a simple implementation based on one factor,
>> we should also look into other indicators as well If that would help build
>> a better scaling algorithm.
>>
>> Thank you,
>>
>> Pavan
>>
>> On Mon, Aug 7, 2023 at 9:55 PM Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I glanced over the design doc.
>>>
>>> You are providing certain configuration parameters plus some settings
>>> based on static values. For example:
>>>
>>> spark.dynamicAllocation.schedulerBacklogTimeout": 54s
>>>
>>> I cannot see any use of <processing time> which ought to be at least
>>> half of the batch interval to have the correct margins (confidence level). 
>>> If
>>> you are going to have additional indicators why not look at scheduling
>>> delay as well. Moreover most of the needed statistics are also available to
>>> set accurate values. My inclination is that this is a great effort but
>>> we ought to utilise the historical statistics collected under
>>> checkpointing directory to get more accurate statistics. I will review
>>> the design document in duew course
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLFr9YSZnw$>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLEPx44C1w$>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 8 Aug 2023 at 01:30, Pavan Kotikalapudi
>>> <pkotikalap...@twilio.com.invalid> wrote:
>>>
>>>> Hi Spark Dev,
>>>>
>>>> I have extended traditional DRA to work for structured streaming
>>>> use-case.
>>>>
>>>> Here is an initial Implementation draft PR
>>>> https://github.com/apache/spark/pull/42352
>>>> <https://urldefense.com/v3/__https://github.com/apache/spark/pull/42352__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLHLe7WCUw$>
>>>>  and
>>>> design doc:
>>>> https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing
>>>> <https://urldefense.com/v3/__https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing__;!!NCc8flgU!blQ5zGotPbReMPXKaZw50BES4V_1AKqHv6bIxHVlc0QfY9iisFjT-u0be3CR6C6-41dtKLX5Ija0-EmAYfkcxLFAjJfilg$>
>>>>
>>>> Please review and let me know what you think.
>>>>
>>>> Thank you,
>>>>
>>>> Pavan
>>>>
>>>

Re: Dynamic resource allocation for structured streaming [SPARK-24815]

Reply via email to