date:20240401

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-01 Thread Mridul Muralidharan

+1

Regards,
Mridul


On Mon, Apr 1, 2024 at 11:26 PM Holden Karau  wrote:

> +1
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Mon, Apr 1, 2024 at 5:44 PM Xinrong Meng  wrote:
>
>> +1
>>
>> Thank you @Hyukjin Kwon 
>>
>> On Mon, Apr 1, 2024 at 10:19 AM Felix Cheung 
>> wrote:
>>
>>> +1
>>> --
>>> *From:* Denny Lee 
>>> *Sent:* Monday, April 1, 2024 10:06:14 AM
>>> *To:* Hussein Awala 
>>> *Cc:* Chao Sun ; Hyukjin Kwon ;
>>> Mridul Muralidharan ; dev 
>>> *Subject:* Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)
>>>
>>> +1 (non-binding)
>>>
>>>
>>> On Mon, Apr 1, 2024 at 9:24 AM Hussein Awala  wrote:
>>>
>>> +1(non-binding) I add to the difference will it make that it will also
>>> simplify package maintenance and easily release a bug fix/new feature
>>> without needing to wait for Pyspark to release.
>>>
>>> On Mon, Apr 1, 2024 at 4:56 PM Chao Sun  wrote:
>>>
>>> +1
>>>
>>> On Sun, Mar 31, 2024 at 10:31 PM Hyukjin Kwon 
>>> wrote:
>>>
>>> Oh I didn't send the discussion thread out as it's pretty simple,
>>> non-invasive and the discussion was sort of done as part of the Spark
>>> Connect initial discussion ..
>>>
>>> On Mon, Apr 1, 2024 at 1:59 PM Mridul Muralidharan 
>>> wrote:
>>>
>>>
>>> Can you point me to the SPIP’s discussion thread please ?
>>> I was not able to find it, but I was on vacation, and so might have
>>> missed this …
>>>
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>> On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee
>>>  wrote:
>>>
>>> +1
>>>
>>> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon 
>>> wrote:
>>>
>>> Hi all,
>>>
>>> I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark
>>> Connect)
>>>
>>> JIRA 
>>> Prototype 
>>> SPIP doc
>>> 
>>>
>>> Please vote on the SPIP for the next 72 hours:
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>> Thanks.
>>>
>>>

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-01 Thread Holden Karau

+1

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Mon, Apr 1, 2024 at 5:44 PM Xinrong Meng  wrote:

> +1
>
> Thank you @Hyukjin Kwon 
>
> On Mon, Apr 1, 2024 at 10:19 AM Felix Cheung 
> wrote:
>
>> +1
>> --
>> *From:* Denny Lee 
>> *Sent:* Monday, April 1, 2024 10:06:14 AM
>> *To:* Hussein Awala 
>> *Cc:* Chao Sun ; Hyukjin Kwon ;
>> Mridul Muralidharan ; dev 
>> *Subject:* Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)
>>
>> +1 (non-binding)
>>
>>
>> On Mon, Apr 1, 2024 at 9:24 AM Hussein Awala  wrote:
>>
>> +1(non-binding) I add to the difference will it make that it will also
>> simplify package maintenance and easily release a bug fix/new feature
>> without needing to wait for Pyspark to release.
>>
>> On Mon, Apr 1, 2024 at 4:56 PM Chao Sun  wrote:
>>
>> +1
>>
>> On Sun, Mar 31, 2024 at 10:31 PM Hyukjin Kwon 
>> wrote:
>>
>> Oh I didn't send the discussion thread out as it's pretty simple,
>> non-invasive and the discussion was sort of done as part of the Spark
>> Connect initial discussion ..
>>
>> On Mon, Apr 1, 2024 at 1:59 PM Mridul Muralidharan 
>> wrote:
>>
>>
>> Can you point me to the SPIP’s discussion thread please ?
>> I was not able to find it, but I was on vacation, and so might have
>> missed this …
>>
>>
>> Regards,
>> Mridul
>>
>>
>> On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee
>>  wrote:
>>
>> +1
>>
>> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon 
>> wrote:
>>
>> Hi all,
>>
>> I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark
>> Connect)
>>
>> JIRA 
>> Prototype 
>> SPIP doc
>> 
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thanks.
>>
>>

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-01 Thread Xinrong Meng

+1

Thank you @Hyukjin Kwon 

On Mon, Apr 1, 2024 at 10:19 AM Felix Cheung 
wrote:

> +1
> --
> *From:* Denny Lee 
> *Sent:* Monday, April 1, 2024 10:06:14 AM
> *To:* Hussein Awala 
> *Cc:* Chao Sun ; Hyukjin Kwon ;
> Mridul Muralidharan ; dev 
> *Subject:* Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)
>
> +1 (non-binding)
>
>
> On Mon, Apr 1, 2024 at 9:24 AM Hussein Awala  wrote:
>
> +1(non-binding) I add to the difference will it make that it will also
> simplify package maintenance and easily release a bug fix/new feature
> without needing to wait for Pyspark to release.
>
> On Mon, Apr 1, 2024 at 4:56 PM Chao Sun  wrote:
>
> +1
>
> On Sun, Mar 31, 2024 at 10:31 PM Hyukjin Kwon 
> wrote:
>
> Oh I didn't send the discussion thread out as it's pretty simple,
> non-invasive and the discussion was sort of done as part of the Spark
> Connect initial discussion ..
>
> On Mon, Apr 1, 2024 at 1:59 PM Mridul Muralidharan 
> wrote:
>
>
> Can you point me to the SPIP’s discussion thread please ?
> I was not able to find it, but I was on vacation, and so might have
> missed this …
>
>
> Regards,
> Mridul
>
>
> On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee
>  wrote:
>
> +1
>
> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon  wrote:
>
> Hi all,
>
> I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark
> Connect)
>
> JIRA 
> Prototype 
> SPIP doc
> 
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks.
>
>

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-01 Thread bo yang

+1 (non-binding)

On Mon, Apr 1, 2024 at 10:19 AM Felix Cheung 
wrote:

> +1
> --
> *From:* Denny Lee 
> *Sent:* Monday, April 1, 2024 10:06:14 AM
> *To:* Hussein Awala 
> *Cc:* Chao Sun ; Hyukjin Kwon ;
> Mridul Muralidharan ; dev 
> *Subject:* Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)
>
> +1 (non-binding)
>
>
> On Mon, Apr 1, 2024 at 9:24 AM Hussein Awala  wrote:
>
> +1(non-binding) I add to the difference will it make that it will also
> simplify package maintenance and easily release a bug fix/new feature
> without needing to wait for Pyspark to release.
>
> On Mon, Apr 1, 2024 at 4:56 PM Chao Sun  wrote:
>
> +1
>
> On Sun, Mar 31, 2024 at 10:31 PM Hyukjin Kwon 
> wrote:
>
> Oh I didn't send the discussion thread out as it's pretty simple,
> non-invasive and the discussion was sort of done as part of the Spark
> Connect initial discussion ..
>
> On Mon, Apr 1, 2024 at 1:59 PM Mridul Muralidharan 
> wrote:
>
>
> Can you point me to the SPIP’s discussion thread please ?
> I was not able to find it, but I was on vacation, and so might have
> missed this …
>
>
> Regards,
> Mridul
>
>
> On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee
>  wrote:
>
> +1
>
> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon  wrote:
>
> Hi all,
>
> I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark
> Connect)
>
> JIRA 
> Prototype 
> SPIP doc
> 
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks.
>
>

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-01 Thread Felix Cheung

+1

From: Denny Lee 
Sent: Monday, April 1, 2024 10:06:14 AM
To: Hussein Awala 
Cc: Chao Sun ; Hyukjin Kwon ; Mridul 
Muralidharan ; dev 
Subject: Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

+1 (non-binding)

On Mon, Apr 1, 2024 at 9:24 AM Hussein Awala 
mailto:huss...@awala.fr>> wrote:
+1(non-binding) I add to the difference will it make that it will also simplify 
package maintenance and easily release a bug fix/new feature without needing to 
wait for Pyspark to release.

On Mon, Apr 1, 2024 at 4:56 PM Chao Sun 
mailto:sunc...@apache.org>> wrote:
+1

On Sun, Mar 31, 2024 at 10:31 PM Hyukjin Kwon 
mailto:gurwls...@apache.org>> wrote:
Oh I didn't send the discussion thread out as it's pretty simple, non-invasive 
and the discussion was sort of done as part of the Spark Connect initial 
discussion ..

On Mon, Apr 1, 2024 at 1:59 PM Mridul Muralidharan 
mailto:mri...@gmail.com>> wrote:

Can you point me to the SPIP’s discussion thread please ?
I was not able to find it, but I was on vacation, and so might have missed this 
…

Regards,
Mridul

On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee 
 wrote:
+1

On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon 
mailto:gurwls...@apache.org>> wrote:
Hi all,

I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark Connect)

JIRA
Prototype
SPIP 
doc

Please vote on the SPIP for the next 72 hours:

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …

Thanks.

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-01 Thread Yuanjian Li

+1

Chao Sun  于2024年4月1日周一 07:56写道：

> +1
>
> On Sun, Mar 31, 2024 at 10:31 PM Hyukjin Kwon 
> wrote:
>
>> Oh I didn't send the discussion thread out as it's pretty simple,
>> non-invasive and the discussion was sort of done as part of the Spark
>> Connect initial discussion ..
>>
>> On Mon, Apr 1, 2024 at 1:59 PM Mridul Muralidharan 
>> wrote:
>>
>>>
>>> Can you point me to the SPIP’s discussion thread please ?
>>> I was not able to find it, but I was on vacation, and so might have
>>> missed this …
>>>
>>>
>>> Regards,
>>> Mridul
>>>
>>
>>> On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee
>>>  wrote:
>>>
 +1

 On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon 
 wrote:

> Hi all,
>
> I'd like to start the vote for SPIP: Pure Python Package in PyPI
> (Spark Connect)
>
> JIRA 
> Prototype 
> SPIP doc
> 
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks.
>

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-01 Thread Denny Lee

+1 (non-binding)


On Mon, Apr 1, 2024 at 9:24 AM Hussein Awala  wrote:

> +1(non-binding) I add to the difference will it make that it will also
> simplify package maintenance and easily release a bug fix/new feature
> without needing to wait for Pyspark to release.
>
> On Mon, Apr 1, 2024 at 4:56 PM Chao Sun  wrote:
>
>> +1
>>
>> On Sun, Mar 31, 2024 at 10:31 PM Hyukjin Kwon 
>> wrote:
>>
>>> Oh I didn't send the discussion thread out as it's pretty simple,
>>> non-invasive and the discussion was sort of done as part of the Spark
>>> Connect initial discussion ..
>>>
>>> On Mon, Apr 1, 2024 at 1:59 PM Mridul Muralidharan 
>>> wrote:
>>>

 Can you point me to the SPIP’s discussion thread please ?
 I was not able to find it, but I was on vacation, and so might have
 missed this …


 Regards,
 Mridul

>>>
 On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee
  wrote:

> +1
>
> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon 
> wrote:
>
>> Hi all,
>>
>> I'd like to start the vote for SPIP: Pure Python Package in PyPI
>> (Spark Connect)
>>
>> JIRA 
>> Prototype 
>> SPIP doc
>> 
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thanks.
>>
>

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-01 Thread Xiao Li

+1

Hussein Awala  于2024年4月1日周一 08:07写道：

> +1(non-binding) I add to the difference will it make that it will also
> simplify package maintenance and easily release a bug fix/new feature
> without needing to wait for Pyspark to release.
>
> On Mon, Apr 1, 2024 at 4:56 PM Chao Sun  wrote:
>
>> +1
>>
>> On Sun, Mar 31, 2024 at 10:31 PM Hyukjin Kwon 
>> wrote:
>>
>>> Oh I didn't send the discussion thread out as it's pretty simple,
>>> non-invasive and the discussion was sort of done as part of the Spark
>>> Connect initial discussion ..
>>>
>>> On Mon, Apr 1, 2024 at 1:59 PM Mridul Muralidharan 
>>> wrote:
>>>

 Can you point me to the SPIP’s discussion thread please ?
 I was not able to find it, but I was on vacation, and so might have
 missed this …


 Regards,
 Mridul

>>>
 On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee
  wrote:

> +1
>
> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon 
> wrote:
>
>> Hi all,
>>
>> I'd like to start the vote for SPIP: Pure Python Package in PyPI
>> (Spark Connect)
>>
>> JIRA 
>> Prototype 
>> SPIP doc
>> 
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thanks.
>>
>

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-01 Thread Hussein Awala

+1(non-binding) I add to the difference will it make that it will also
simplify package maintenance and easily release a bug fix/new feature
without needing to wait for Pyspark to release.

On Mon, Apr 1, 2024 at 4:56 PM Chao Sun  wrote:

> +1
>
> On Sun, Mar 31, 2024 at 10:31 PM Hyukjin Kwon 
> wrote:
>
>> Oh I didn't send the discussion thread out as it's pretty simple,
>> non-invasive and the discussion was sort of done as part of the Spark
>> Connect initial discussion ..
>>
>> On Mon, Apr 1, 2024 at 1:59 PM Mridul Muralidharan 
>> wrote:
>>
>>>
>>> Can you point me to the SPIP’s discussion thread please ?
>>> I was not able to find it, but I was on vacation, and so might have
>>> missed this …
>>>
>>>
>>> Regards,
>>> Mridul
>>>
>>
>>> On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee
>>>  wrote:
>>>
 +1

 On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon 
 wrote:

> Hi all,
>
> I'd like to start the vote for SPIP: Pure Python Package in PyPI
> (Spark Connect)
>
> JIRA 
> Prototype 
> SPIP doc
> 
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks.
>

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-01 Thread Chao Sun

+1

On Sun, Mar 31, 2024 at 10:31 PM Hyukjin Kwon  wrote:

> Oh I didn't send the discussion thread out as it's pretty simple,
> non-invasive and the discussion was sort of done as part of the Spark
> Connect initial discussion ..
>
> On Mon, Apr 1, 2024 at 1:59 PM Mridul Muralidharan 
> wrote:
>
>>
>> Can you point me to the SPIP’s discussion thread please ?
>> I was not able to find it, but I was on vacation, and so might have
>> missed this …
>>
>>
>> Regards,
>> Mridul
>>
>
>> On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee
>>  wrote:
>>
>>> +1
>>>
>>> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon 
>>> wrote:
>>>
 Hi all,

 I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark
 Connect)

 JIRA 
 Prototype 
 SPIP doc
 

 Please vote on the SPIP for the next 72 hours:

 [ ] +1: Accept the proposal as an official SPIP
 [ ] +0
 [ ] -1: I don’t think this is a good idea because …

 Thanks.

>>>

Re: Scheduling jobs using FAIR pool

2024-04-01 Thread Hussein Awala

IMO the questions are not limited to Databricks.

> The Round-Robin distribution of executors only work in case of empty
executors (achievable by enabling dynamic allocation). In case the jobs
(part of the same pool) requires all executors, second jobs will still need
to wait.

This feature in Spark allows for optimal resource utilization. Consider a
scenario with two stages, each with 500 tasks (500 partitions), generated
by two threads, and a total of 100 Spark executors available in the fair
pool.
The first thread may be instantiated microseconds ahead of the second,
resulting in the fair scheduler allocating 100 tasks to the first stage
initially. Once some of the tasks are complete, the scheduler dynamically
redistributes resources, ultimately splitting the capacity equally between
both stages. This will work in the same way if you have a single stage but
without splitting the capacity.

Regarding the other three questions, dynamically creating pools may not be
advisable due to several considerations (cleanup issues, mixing application
and infrastructure management, + a lot of unexpected issues).

For scenarios involving stages with few long-running tasks like yours, it's
recommended to enable dynamic allocation to let Spark add executors as
needed.

In the context of streaming workloads, streaming dynamic allocation is
preferred to address specific issues detailed in SPARK-12133
. Although the
configurations for this feature are not documented, they can be found in the
source code

.
But for structured streaming (your case), you should use batch one (
spark.dynamicAllocation.*), as SPARK-24815
 is not ready yet (it
was accepted and will be ready soon), but it has some issues in the
downscale step, you can check the JIRA issue for more details.

On Mon, Apr 1, 2024 at 2:07 PM Varun Shah  wrote:

> Hi Mich,
>
> I did not post in the databricks community, as most of the questions were
> related to spark itself.
>
> But let me also post the question on databricks community.
>
> Thanks,
> Varun Shah
>
> On Mon, Apr 1, 2024, 16:28 Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> Have you put this question to Databricks forum
>>
>> Data Engineering - Databricks
>> 
>>
>>
>> Mich Talebzadeh,
>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Mon, 1 Apr 2024 at 07:22, Varun Shah 
>> wrote:
>>
>>> Hi Community,
>>>
>>> I am currently exploring the best use of "Scheduler Pools" for executing
>>> jobs in parallel, and require clarification and suggestions on a few points.
>>>
>>> The implementation consists of executing "Structured Streaming" jobs on
>>> Databricks using AutoLoader. Each stream is executed with trigger =
>>> 'AvailableNow', ensuring that the streams don't keep running for the
>>> source. (we have ~4000 such streams, with no continuous stream from source,
>>> hence not keeping the streams running infinitely using other triggers).
>>>
>>> One way to achieve parallelism in the jobs is to use "MultiThreading",
>>> all using same SparkContext, as quoted from official docs: "Inside a given
>>> Spark application (SparkContext instance), multiple parallel jobs can run
>>> simultaneously if they were submitted from separate threads."
>>>
>>> There's also a availability of "FAIR Scheduler", which instead of FIFO
>>> Scheduler (default), assigns executors in Round-Robin fashion, ensuring the
>>> smaller jobs that were submitted later do not starve due to bigger jobs
>>> submitted early consuming all resources.
>>>
>>> Here are my questions:
>>> 1. The Round-Robin distribution of executors only work in case of empty
>>> executors (achievable by enabling dynamic allocation). In case the jobs
>>> (part of the same pool) requires all executors, second jobs will still need
>>> to wait.
>>> 2. If we create dynamic pools for submitting each stream (by setting
>>> spark property -> "spark.scheduler.pool" to a dynamic value as
>>> spark.sparkContext.setLocalProperty("spark.scheduler.pool", ">> string>") , how does executor allocation happen ? Since all pools created
>>> are created dynamically, they share equal weight. Does this also work the
>>> same way as

Re: Scheduling jobs using FAIR pool

2024-04-01 Thread Varun Shah

Hi Mich,

I did not post in the databricks community, as most of the questions were
related to spark itself.

But let me also post the question on databricks community.

Thanks,
Varun Shah

On Mon, Apr 1, 2024, 16:28 Mich Talebzadeh 
wrote:

> Hi,
>
> Have you put this question to Databricks forum
>
> Data Engineering - Databricks
> 
>
>
> Mich Talebzadeh,
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Mon, 1 Apr 2024 at 07:22, Varun Shah  wrote:
>
>> Hi Community,
>>
>> I am currently exploring the best use of "Scheduler Pools" for executing
>> jobs in parallel, and require clarification and suggestions on a few points.
>>
>> The implementation consists of executing "Structured Streaming" jobs on
>> Databricks using AutoLoader. Each stream is executed with trigger =
>> 'AvailableNow', ensuring that the streams don't keep running for the
>> source. (we have ~4000 such streams, with no continuous stream from source,
>> hence not keeping the streams running infinitely using other triggers).
>>
>> One way to achieve parallelism in the jobs is to use "MultiThreading",
>> all using same SparkContext, as quoted from official docs: "Inside a given
>> Spark application (SparkContext instance), multiple parallel jobs can run
>> simultaneously if they were submitted from separate threads."
>>
>> There's also a availability of "FAIR Scheduler", which instead of FIFO
>> Scheduler (default), assigns executors in Round-Robin fashion, ensuring the
>> smaller jobs that were submitted later do not starve due to bigger jobs
>> submitted early consuming all resources.
>>
>> Here are my questions:
>> 1. The Round-Robin distribution of executors only work in case of empty
>> executors (achievable by enabling dynamic allocation). In case the jobs
>> (part of the same pool) requires all executors, second jobs will still need
>> to wait.
>> 2. If we create dynamic pools for submitting each stream (by setting
>> spark property -> "spark.scheduler.pool" to a dynamic value as
>> spark.sparkContext.setLocalProperty("spark.scheduler.pool", "> string>") , how does executor allocation happen ? Since all pools created
>> are created dynamically, they share equal weight. Does this also work the
>> same way as submitting streams to a single pool as a FAIR scheduler ?
>> 3. Official docs quote "inside each pool, jobs run in FIFO order.". Is
>> this true for the FAIR scheduler also ? By definition, it does not seem
>> right, but it's confusing. It says "By Default" , so does it mean for FIFO
>> scheduler or by default for both scheduling types ?
>> 4. Are there any overhead for spark driver while creating / using a
>> dynamically created spark pool vs pre-defined pools ?
>>
>> Apart from these, any suggestions or ways you have implemented
>> auto-scaling for such loads ? We are currently trying to auto-scale the
>> resources based on requests, but scaling down is an issue (known already
>> for which SPIP is already in discussion, but it does not cater to
>> submitting multiple streams in a single cluster.
>>
>> Thanks for reading !! Looking forward to your suggestions
>>
>> Regards,
>> Varun Shah
>>
>>
>>
>>
>>

Re: Scheduling jobs using FAIR pool

2024-04-01 Thread Mich Talebzadeh

Hi,

Have you put this question to Databricks forum

Data Engineering - Databricks



Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 1 Apr 2024 at 07:22, Varun Shah  wrote:

> Hi Community,
>
> I am currently exploring the best use of "Scheduler Pools" for executing
> jobs in parallel, and require clarification and suggestions on a few points.
>
> The implementation consists of executing "Structured Streaming" jobs on
> Databricks using AutoLoader. Each stream is executed with trigger =
> 'AvailableNow', ensuring that the streams don't keep running for the
> source. (we have ~4000 such streams, with no continuous stream from source,
> hence not keeping the streams running infinitely using other triggers).
>
> One way to achieve parallelism in the jobs is to use "MultiThreading", all
> using same SparkContext, as quoted from official docs: "Inside a given
> Spark application (SparkContext instance), multiple parallel jobs can run
> simultaneously if they were submitted from separate threads."
>
> There's also a availability of "FAIR Scheduler", which instead of FIFO
> Scheduler (default), assigns executors in Round-Robin fashion, ensuring the
> smaller jobs that were submitted later do not starve due to bigger jobs
> submitted early consuming all resources.
>
> Here are my questions:
> 1. The Round-Robin distribution of executors only work in case of empty
> executors (achievable by enabling dynamic allocation). In case the jobs
> (part of the same pool) requires all executors, second jobs will still need
> to wait.
> 2. If we create dynamic pools for submitting each stream (by setting spark
> property -> "spark.scheduler.pool" to a dynamic value as
> spark.sparkContext.setLocalProperty("spark.scheduler.pool", " string>") , how does executor allocation happen ? Since all pools created
> are created dynamically, they share equal weight. Does this also work the
> same way as submitting streams to a single pool as a FAIR scheduler ?
> 3. Official docs quote "inside each pool, jobs run in FIFO order.". Is
> this true for the FAIR scheduler also ? By definition, it does not seem
> right, but it's confusing. It says "By Default" , so does it mean for FIFO
> scheduler or by default for both scheduling types ?
> 4. Are there any overhead for spark driver while creating / using a
> dynamically created spark pool vs pre-defined pools ?
>
> Apart from these, any suggestions or ways you have implemented
> auto-scaling for such loads ? We are currently trying to auto-scale the
> resources based on requests, but scaling down is an issue (known already
> for which SPIP is already in discussion, but it does not cater to
> submitting multiple streams in a single cluster.
>
> Thanks for reading !! Looking forward to your suggestions
>
> Regards,
> Varun Shah
>
>
>
>
>

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-01 Thread Kent Yao

+1(non-binding). Thank you, Hyukjin.

Kent Yao

Takuya UESHIN  于2024年4月1日周一 18:04写道：
>
> +1
>
> On Sun, Mar 31, 2024 at 6:16 PM Hyukjin Kwon  wrote:
>>
>> Hi all,
>>
>> I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark 
>> Connect)
>>
>> JIRA
>> Prototype
>> SPIP doc
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thanks.
>
>
>
> --
> Takuya UESHIN
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

Re: Scheduling jobs using FAIR pool

Re: Scheduling jobs using FAIR pool

Re: Scheduling jobs using FAIR pool

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

14 matches

Site Navigation

Mail list logo

Footer information