Re: Improved Structured Streaming Documentation Proof-of-Concept

2024-03-25 Thread Neil Ramaswamy
I'm glad you think it's generally a good idea!

I will mention, though, that with these better docs I've almost finished,
I'm hoping that Structured Streaming no longer stays a specialist topic
that requires "trench warfare." With good pedagogy, I think that it's very
approachable. The Knowledge Sharing Hub could be useful for e2e real-world
use-cases, but I think that operator semantics, stream configurations, etc.
have a better home in the official documentation.

Thanks for your engagement, Mich. Looking forward to hearing others'
opinions.

Neil

On Mon, Mar 25, 2024 at 2:50 PM Mich Talebzadeh 
wrote:

> Hi,
>
> Your intended work on improving the Structured Streaming documentation is
> great! Clear and well-organized instructions are important for everyone
> using Spark, beginners and experts alike.
> Having said that, Spark Structured Streaming much like other specialist
> topics with Spark say (k8s) or otherwise cannot be mastered by
> documentation alone. These topics require a considerable amount of practice
> and trench warfare so to speak to master them. Suffice to say that I agree
> with the proposals of making examples. However, it is an area that many try
> to master but fail( judging by typical issues brought up in the user group
> and otherwise). Perhaps using a section such as the proposed "Knowledge
> Sharing Hub'', may become more relevant. Moreover, the examples have to
> reflect real life scenarios and conversly will be of limited use otherwise.
>
> HTH
>
> Mich Talebzadeh,
>
> Technologist | Data | Generative AI | Financial Fraud
>
> London
> United Kingdom
>
>
>view my Linkedin profile
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> Disclaimer: The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner Von Braun)".
>
> Mich Talebzadeh,
> Technologist | Data | Generative AI | Financial Fraud
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Mon, 25 Mar 2024 at 21:19, Neil Ramaswamy  wrote:
>
>> Hi all,
>>
>> I recently started an effort to improve the Structured Streaming
>> documentation. I thought that the current documentation, while very
>> comprehensive, could be improved in terms of organization, clarity, and
>> presence of examples.
>>
>> You can view the repo here
>> , and you can see
>> a preview of the site here .
>> It's almost at full parity with the programming guide, and it also has
>> additional content, like a guide on unit testing and an in-depth
>> explanation of watermarks. I think it's at a point where we can bring this
>> to completion if it's something that the community wants.
>>
>> I'd love to hear feedback from everyone: is this something that we would
>> want to move forward with? As it borrows certain parts from the programming
>> guide, it has an Apache License, so I'd be more than happy if it is adopted
>> by an official Spark repo.
>>
>> Best,
>> Neil
>>
>


Re: Improved Structured Streaming Documentation Proof-of-Concept

2024-03-25 Thread Mich Talebzadeh
Hi,

Your intended work on improving the Structured Streaming documentation is
great! Clear and well-organized instructions are important for everyone
using Spark, beginners and experts alike.
Having said that, Spark Structured Streaming much like other specialist
topics with Spark say (k8s) or otherwise cannot be mastered by
documentation alone. These topics require a considerable amount of practice
and trench warfare so to speak to master them. Suffice to say that I agree
with the proposals of making examples. However, it is an area that many try
to master but fail( judging by typical issues brought up in the user group
and otherwise). Perhaps using a section such as the proposed "Knowledge
Sharing Hub'', may become more relevant. Moreover, the examples have to
reflect real life scenarios and conversly will be of limited use otherwise.

HTH

Mich Talebzadeh,

Technologist | Data | Generative AI | Financial Fraud

London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my knowledge
but of course cannot be guaranteed . It is essential to note that, as with
any advice, quote "one test result is worth one-thousand expert opinions
(Werner Von Braun)".

Mich Talebzadeh,
Technologist | Data | Generative AI | Financial Fraud
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 25 Mar 2024 at 21:19, Neil Ramaswamy  wrote:

> Hi all,
>
> I recently started an effort to improve the Structured Streaming
> documentation. I thought that the current documentation, while very
> comprehensive, could be improved in terms of organization, clarity, and
> presence of examples.
>
> You can view the repo here
> , and you can see
> a preview of the site here .
> It's almost at full parity with the programming guide, and it also has
> additional content, like a guide on unit testing and an in-depth
> explanation of watermarks. I think it's at a point where we can bring this
> to completion if it's something that the community wants.
>
> I'd love to hear feedback from everyone: is this something that we would
> want to move forward with? As it borrows certain parts from the programming
> guide, it has an Apache License, so I'd be more than happy if it is adopted
> by an official Spark repo.
>
> Best,
> Neil
>


Re: Allowing Unicode Whitespace in Lexer

2024-03-25 Thread Alex Cruise
While we're at it, maybe consider allowing "smart quotes" too :)

-0xe1a

On Sat, Mar 23, 2024 at 5:29 PM serge rielau.com  wrote:

> Hello,
>
> I have a PR https://github.com/apache/spark/pull/45620  ready to go that
> will extend the definition of whitespace (what separates token) from the
> small set of ASCII characters space, tab, linefeed to those defined in
> Unicode.
> While this is a small and safe change, it is one where we would have a
> hard time changing our minds about later.
> It is also a change that, AFAIK, cannot be controlled under a config.
>
> What does the community think?
>
> Cheers
> Serge
> SQL Architect at Databricks
>
>


Improved Structured Streaming Documentation Proof-of-Concept

2024-03-25 Thread Neil Ramaswamy
Hi all,

I recently started an effort to improve the Structured Streaming
documentation. I thought that the current documentation, while very
comprehensive, could be improved in terms of organization, clarity, and
presence of examples.

You can view the repo here
, and you can see a
preview of the site here . It's
almost at full parity with the programming guide, and it also has
additional content, like a guide on unit testing and an in-depth
explanation of watermarks. I think it's at a point where we can bring this
to completion if it's something that the community wants.

I'd love to hear feedback from everyone: is this something that we would
want to move forward with? As it borrows certain parts from the programming
guide, it has an Apache License, so I'd be more than happy if it is adopted
by an official Spark repo.

Best,
Neil


Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-03-25 Thread Bhuwan Sahni
Hi Pavan,

I looked at the PR, and the changes look simple and contained. It would be
useful to add dynamic resource allocation to Spark Structured Streaming.

Jungtaek. Would you be able to shepherd this change?


On Tue, Mar 19, 2024 at 10:38 AM Bhuwan Sahni 
wrote:

> Thanks a lot for creating the risk table Pavan. My apologies. I was tied
> up with high priority items for the last couple weeks and could not
> respond. I will review the PR by tomorrow's end, and get back to you.
>
> Appreciate your patience.
>
> Thanks
> Bhuwan Sahni
>
> On Sun, Mar 17, 2024 at 4:42 PM Pavan Kotikalapudi <
> pkotikalap...@twilio.com> wrote:
>
>> Hi Bhuwan,
>>
>> I hope the team got a chance to review the draft PR, looking for some
>> comments to see if the plan looks alright?. I have updated the document
>> about the risks
>> .(also
>> mentioned below). Please confirm if it looks alright?
>>
>> *Spark application type*
>>
>> *auto-scaling capability*
>>
>> *with New auto-scaling capability*
>>
>> Spark Batch job
>>
>> Works with current DRA
>>
>> No - change
>>
>> Streaming query without trigger interval
>>
>> No implementation
>>
>> Can work with this implementation - (have to set certain scale back
>> configs based on previous usage pattern) - maybe automate with future work?
>>
>> Spark Streaming query with Trigger interval
>>
>> No implementation
>>
>> With this implementation
>>
>> Spark Streaming query with one-time micro batch
>>
>> Works with current DRA
>>
>> No - change
>>
>> Spark Streaming query with
>>
>> Availablenow micro batch
>>
>> Works with current DRA
>>
>> No - change
>>
>> Batch + Streaming query (
>>
>> default/
>>
>> triggger-interval/
>>
>> once/
>>
>> availablenow modes), other notebook use cases.
>>
>> No implementation
>>
>> No implementation
>>
>>
>>
>> We are more than happy to collaborate on a call to make better progress
>> on this enhancement. Please let us know.
>>
>> Thank you,
>>
>> Pavan
>>
>> On Fri, Mar 1, 2024 at 12:26 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>>
>>> Hi Bhuwan et al,
>>>
>>> Thank you for passing on the DataBricks Structured Streaming team's
>>> review of the SPIP document. FYI, I work closely with Pawan and other
>>> members to help deliver this piece of work. We appreciate your insights,
>>> especially regarding the cost savings potential from the PoC.
>>>
>>> Pavan already furnished you with some additional info. Your team's point
>>> about the SPIP currently addressing a specific use case (single streaming
>>> query with Processing Time trigger) is well-taken. We agree that
>>> maintaining simplicity is key, particularly as we explore more general
>>> resource allocation mechanisms in the future. To address the concerns and
>>> foster open discussion, The DataBricks team are invited to directly add
>>> their comments and suggestions to the Jira itself
>>>
>>> [SPARK-24815] Structured Streaming should support dynamic allocation -
>>> ASF JIRA (apache.org)
>>> 
>>> This will ensure everyone involved can benefit from your team's
>>> expertise and facilitate further collaboration.
>>>
>>> Thanks
>>>
>>> Mich Talebzadeh,
>>> Dad | Technologist | Solutions Architect | Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>> 
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> Von
>>> Braun
>>> 
>>> )".
>>>
>>>
>>> On Fri, 1 Mar 2024 at 19:59, Pavan Kotikalapudi
>>>  wrote:
>>>
 Thanks Bhuwan and rest of the databricks team for the reviews,

 I appreciate your reviews, was very helpful in evaluating a few options
 that were overlooked earlier (especially about mixed spark apps running on
 notebooks). 

Re: [DISCUSS] MySQL version support policy

2024-03-25 Thread Cheng Pan
Thanks Dongjoon’s reply and questions,

> A. Adding a new Apache Spark community policy (contract) to guarantee MySQL
> LTS Versions Support.

Yes, at least the latest MySQL LTS version. To reduce the maintenance efforts 
on the Spark side, I think we can only run CI with the latest LTS version but 
accept reasonable patches for compatibility with older LTS versions. For 
example, Spark on K8s is only verified with the latest minikube in CI, and also 
accepts reasonable patches for older K8s.

> B. Dropping the support of non-LTS version support (MySQL 8.3/8.2/8.1)

Those versions likely work well too. For example, Spark currently officially 
supports JDK 17 and 21, it likely works on JDK 20 too, but has not been 
verified by the community.

> 1. For (A), do you mean MySQL LTS versions are not supported by Apache Spark 
> releases properly due to the improper test suite?

Not yet. MySQL retains good backward compatibilities so far, I see a lot of 
users use MySQL 8.0 drivers to access both MySQL 5.7/8.0 servers through Spark 
JDBC datasource, everything goes well so far.

> 2. For (B), why does Apache Spark need to drop non-LTS MySQL support?

I think we can accept reasonable patches with careful review, but neither 
official support declaration nor CI verification is required, just like we do 
for JDK version support.

> 3. What about MariaDB? Do we need to stick to some versions?

I’m not familiar with MariaDB, but I would treat it as a MySQL-compatible 
product, in the same position as Amazon RDS for MySQL, neither official support 
declaration nor CI verification is required, but considering the adoption rate 
of those products, reasonable patches should be considered too.

Thanks,
Cheng Pan

On 2024/03/25 06:47:10 Dongjoon Hyun wrote:
> Hi, Cheng.
> 
> Thank you for the suggestion. Your suggestion seems to have at least two
> themes.
> 
> A. Adding a new Apache Spark community policy (contract) to guarantee MySQL
> LTS Versions Support.
> B. Dropping the support of non-LTS version support (MySQL 8.3/8.2/8.1)
> 
> And, it brings me three questions.
> 
> 1. For (A), do you mean MySQL LTS versions are not supported by Apache
> Spark releases properly due to the improper test suite?
> 2. For (B), why does Apache Spark need to drop non-LTS MySQL support?
> 3. What about MariaDB? Do we need to stick to some versions?
> 
> To be clear, if needed, we can have daily GitHub Action CIs easily like
> Python CI (Python 3.8/3.10/3.11/3.12).
> 
> -
> https://github.com/apache/spark/blob/master/.github/workflows/build_python.yml
> 
> Thanks,
> Dongjoon.
> 
> 
> On Sun, Mar 24, 2024 at 10:29 PM Cheng Pan  wrote:
> 
> > Hi, Spark community,
> >
> > I noticed that the Spark JDBC connector MySQL dialect is testing against
> > the 8.3.0[1] now, a non-LTS version.
> >
> > MySQL changed the version policy recently[2], which is now very similar to
> > the Java version policy. In short, 5.5, 5.6, 5.7, 8.0 is the LTS version,
> > 8.1, 8.2, 8.3 is non-LTS, and the next LTS version is 8.4.
> >
> > I would say that MySQL is one of the most important infrastructures today,
> > I checked the AWS RDS MySQL[4] and Azure Database for MySQL[5] version
> > support policy, and both only support 5.7 and 8.0.
> >
> > Also, Spark officially only supports LTS Java versions, like JDK 17 and
> > 21, but not 22. I would recommend using MySQL 8.0 for testing until the
> > next MySQL LTS version (8.4) is available.
> >
> > Additional discussion can be found at [3]
> >
> > [1] https://issues.apache.org/jira/browse/SPARK-47453
> > [2]
> > https://dev.mysql.com/blog-archive/introducing-mysql-innovation-and-long-term-support-lts-versions/
> > [3] https://github.com/apache/spark/pull/45581
> > [4] https://aws.amazon.com/rds/mysql/
> > [5] https://learn.microsoft.com/en-us/azure/mysql/concepts-version-policy
> >
> > Thanks,
> > Cheng Pan
> >
> >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> >
>  



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] MySQL version support policy

2024-03-25 Thread Dongjoon Hyun
Hi, Cheng.

Thank you for the suggestion. Your suggestion seems to have at least two
themes.

A. Adding a new Apache Spark community policy (contract) to guarantee MySQL
LTS Versions Support.
B. Dropping the support of non-LTS version support (MySQL 8.3/8.2/8.1)

And, it brings me three questions.

1. For (A), do you mean MySQL LTS versions are not supported by Apache
Spark releases properly due to the improper test suite?
2. For (B), why does Apache Spark need to drop non-LTS MySQL support?
3. What about MariaDB? Do we need to stick to some versions?

To be clear, if needed, we can have daily GitHub Action CIs easily like
Python CI (Python 3.8/3.10/3.11/3.12).

-
https://github.com/apache/spark/blob/master/.github/workflows/build_python.yml

Thanks,
Dongjoon.


On Sun, Mar 24, 2024 at 10:29 PM Cheng Pan  wrote:

> Hi, Spark community,
>
> I noticed that the Spark JDBC connector MySQL dialect is testing against
> the 8.3.0[1] now, a non-LTS version.
>
> MySQL changed the version policy recently[2], which is now very similar to
> the Java version policy. In short, 5.5, 5.6, 5.7, 8.0 is the LTS version,
> 8.1, 8.2, 8.3 is non-LTS, and the next LTS version is 8.4.
>
> I would say that MySQL is one of the most important infrastructures today,
> I checked the AWS RDS MySQL[4] and Azure Database for MySQL[5] version
> support policy, and both only support 5.7 and 8.0.
>
> Also, Spark officially only supports LTS Java versions, like JDK 17 and
> 21, but not 22. I would recommend using MySQL 8.0 for testing until the
> next MySQL LTS version (8.4) is available.
>
> Additional discussion can be found at [3]
>
> [1] https://issues.apache.org/jira/browse/SPARK-47453
> [2]
> https://dev.mysql.com/blog-archive/introducing-mysql-innovation-and-long-term-support-lts-versions/
> [3] https://github.com/apache/spark/pull/45581
> [4] https://aws.amazon.com/rds/mysql/
> [5] https://learn.microsoft.com/en-us/azure/mysql/concepts-version-policy
>
> Thanks,
> Cheng Pan
>
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>