Re: Contributions and help needed in SPARK-40005

2022-08-30 Thread Khalid Mammadov
Will do, thanks!

On Wed, 31 Aug 2022, 01:14 Hyukjin Kwon,  wrote:

> Oh, that's a mistake. please just go ahead and reuse that JIRA :-).
> You can just create a PR with reusing the same JIRA ID for functions.py
>
> On Wed, 31 Aug 2022 at 01:18, Khalid Mammadov 
> wrote:
>
>> Hi @Hyukjin Kwon 
>>
>> I see you have resolved the JIRA and I got some more things to do in
>> functions.py (only done 50%). So shall I create a new JIRA for each new PR
>> or ok to reuse this one?
>>
>> On Fri, 19 Aug 2022, 09:29 Khalid Mammadov, 
>> wrote:
>>
>>> Will do, thanks!
>>>
>>> On Fri, 19 Aug 2022, 09:11 Hyukjin Kwon,  wrote:
>>>
 Sure, that would be great.

 I did the first 25 functions in functions.py. Please go ahead with the
 rest of them.
 You can create a PR with the title such
 as [SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions
 examples self-contained (part 2, 25 functions)

 Thanks!

 On Fri, 19 Aug 2022 at 16:50, Khalid Mammadov <
 khalidmammad...@gmail.com> wrote:

> I am picking up "functions.py" if noone is already
>
> On Fri, 19 Aug 2022, 07:56 Khalid Mammadov, 
> wrote:
>
>> I thought it's all finished (checked few). Do you have list of those
>> 50%?
>> Happy to contribute 
>>
>> On Fri, 19 Aug 2022, 05:54 Hyukjin Kwon,  wrote:
>>
>>> We're half way, roughly 50%. More contributions would be very
>>> helpful.
>>> If the size of the file is too large, feel free to split it to
>>> multiple parts (e.g., https://github.com/apache/spark/pull/37575)
>>>
>>> On Tue, 9 Aug 2022 at 12:26, Qian SUN 
>>> wrote:
>>>
 Sure, I will do it. SPARK-40010
  is built to
 track progress.

 Hyukjin Kwon gurwls...@gmail.com
  于2022年8月9日周二 10:58写道:

 Please go ahead. Would be very appreciated.
>
> On Tue, 9 Aug 2022 at 11:58, Qian SUN 
> wrote:
>
>> Hi Hyukjin
>>
>> I would like to do some work and pick up *Window.py *if possible.
>>
>> Thanks,
>> Qian
>>
>> Hyukjin Kwon  于2022年8月9日周二 10:41写道:
>>
>>> Thanks Khalid for taking a look.
>>>
>>> On Tue, 9 Aug 2022 at 00:37, Khalid Mammadov <
>>> khalidmammad...@gmail.com> wrote:
>>>
 Hi Hyukjin
 That's great initiative, here is a PR that address one of those
 issues that's waiting for review:
 https://github.com/apache/spark/pull/37408

 Perhaps, it would be also good to track these pending issues
 somewhere to avoid effort duplication.

 For example, I would like to pick up *union* and *union all*
 if no one has already.

 Thanks,
 Khalid


 On Mon, Aug 8, 2022 at 1:44 PM Hyukjin Kwon <
 gurwls...@gmail.com> wrote:

> Hi all,
>
> I am trying to improve PySpark documentation especially:
>
>- Make the examples self-contained, e.g.,
>
> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html
>- Document Parameters
>
> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html#pandas.DataFrame.pivot.
>There are many API that misses parameters in PySpark, e.g., 
> DataFrame.union
>
> Here is one example PR I am working on:
> https://github.com/apache/spark/pull/37437
> I can't do it all by myself. Any help, review, and
> contributions would be welcome and appreciated.
>
> Thank you all in advance.
>

>>
>> --
>> Best!
>> Qian SUN
>>
> --
 Best!
 Qian SUN

>>>


Re: Contributions and help needed in SPARK-40005

2022-08-30 Thread Hyukjin Kwon
Oh, that's a mistake. please just go ahead and reuse that JIRA :-).
You can just create a PR with reusing the same JIRA ID for functions.py

On Wed, 31 Aug 2022 at 01:18, Khalid Mammadov 
wrote:

> Hi @Hyukjin Kwon 
>
> I see you have resolved the JIRA and I got some more things to do in
> functions.py (only done 50%). So shall I create a new JIRA for each new PR
> or ok to reuse this one?
>
> On Fri, 19 Aug 2022, 09:29 Khalid Mammadov, 
> wrote:
>
>> Will do, thanks!
>>
>> On Fri, 19 Aug 2022, 09:11 Hyukjin Kwon,  wrote:
>>
>>> Sure, that would be great.
>>>
>>> I did the first 25 functions in functions.py. Please go ahead with the
>>> rest of them.
>>> You can create a PR with the title such
>>> as [SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions
>>> examples self-contained (part 2, 25 functions)
>>>
>>> Thanks!
>>>
>>> On Fri, 19 Aug 2022 at 16:50, Khalid Mammadov 
>>> wrote:
>>>
 I am picking up "functions.py" if noone is already

 On Fri, 19 Aug 2022, 07:56 Khalid Mammadov, 
 wrote:

> I thought it's all finished (checked few). Do you have list of those
> 50%?
> Happy to contribute 
>
> On Fri, 19 Aug 2022, 05:54 Hyukjin Kwon,  wrote:
>
>> We're half way, roughly 50%. More contributions would be very helpful.
>> If the size of the file is too large, feel free to split it to
>> multiple parts (e.g., https://github.com/apache/spark/pull/37575)
>>
>> On Tue, 9 Aug 2022 at 12:26, Qian SUN  wrote:
>>
>>> Sure, I will do it. SPARK-40010
>>>  is built to
>>> track progress.
>>>
>>> Hyukjin Kwon gurwls...@gmail.com 
>>> 于2022年8月9日周二 10:58写道:
>>>
>>> Please go ahead. Would be very appreciated.

 On Tue, 9 Aug 2022 at 11:58, Qian SUN 
 wrote:

> Hi Hyukjin
>
> I would like to do some work and pick up *Window.py *if possible.
>
> Thanks,
> Qian
>
> Hyukjin Kwon  于2022年8月9日周二 10:41写道:
>
>> Thanks Khalid for taking a look.
>>
>> On Tue, 9 Aug 2022 at 00:37, Khalid Mammadov <
>> khalidmammad...@gmail.com> wrote:
>>
>>> Hi Hyukjin
>>> That's great initiative, here is a PR that address one of those
>>> issues that's waiting for review:
>>> https://github.com/apache/spark/pull/37408
>>>
>>> Perhaps, it would be also good to track these pending issues
>>> somewhere to avoid effort duplication.
>>>
>>> For example, I would like to pick up *union* and *union all* if
>>> no one has already.
>>>
>>> Thanks,
>>> Khalid
>>>
>>>
>>> On Mon, Aug 8, 2022 at 1:44 PM Hyukjin Kwon 
>>> wrote:
>>>
 Hi all,

 I am trying to improve PySpark documentation especially:

- Make the examples self-contained, e.g.,

 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html
- Document Parameters

 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html#pandas.DataFrame.pivot.
There are many API that misses parameters in PySpark, e.g., 
 DataFrame.union

 Here is one example PR I am working on:
 https://github.com/apache/spark/pull/37437
 I can't do it all by myself. Any help, review, and
 contributions would be welcome and appreciated.

 Thank you all in advance.

>>>
>
> --
> Best!
> Qian SUN
>
 --
>>> Best!
>>> Qian SUN
>>>
>>


Re: Contributions and help needed in SPARK-40005

2022-08-30 Thread Khalid Mammadov
Hi @Hyukjin Kwon 

I see you have resolved the JIRA and I got some more things to do in
functions.py (only done 50%). So shall I create a new JIRA for each new PR
or ok to reuse this one?

On Fri, 19 Aug 2022, 09:29 Khalid Mammadov, 
wrote:

> Will do, thanks!
>
> On Fri, 19 Aug 2022, 09:11 Hyukjin Kwon,  wrote:
>
>> Sure, that would be great.
>>
>> I did the first 25 functions in functions.py. Please go ahead with the
>> rest of them.
>> You can create a PR with the title such
>> as [SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions
>> examples self-contained (part 2, 25 functions)
>>
>> Thanks!
>>
>> On Fri, 19 Aug 2022 at 16:50, Khalid Mammadov 
>> wrote:
>>
>>> I am picking up "functions.py" if noone is already
>>>
>>> On Fri, 19 Aug 2022, 07:56 Khalid Mammadov, 
>>> wrote:
>>>
 I thought it's all finished (checked few). Do you have list of those
 50%?
 Happy to contribute 

 On Fri, 19 Aug 2022, 05:54 Hyukjin Kwon,  wrote:

> We're half way, roughly 50%. More contributions would be very helpful.
> If the size of the file is too large, feel free to split it to
> multiple parts (e.g., https://github.com/apache/spark/pull/37575)
>
> On Tue, 9 Aug 2022 at 12:26, Qian SUN  wrote:
>
>> Sure, I will do it. SPARK-40010
>>  is built to
>> track progress.
>>
>> Hyukjin Kwon gurwls...@gmail.com 
>> 于2022年8月9日周二 10:58写道:
>>
>> Please go ahead. Would be very appreciated.
>>>
>>> On Tue, 9 Aug 2022 at 11:58, Qian SUN 
>>> wrote:
>>>
 Hi Hyukjin

 I would like to do some work and pick up *Window.py *if possible.

 Thanks,
 Qian

 Hyukjin Kwon  于2022年8月9日周二 10:41写道:

> Thanks Khalid for taking a look.
>
> On Tue, 9 Aug 2022 at 00:37, Khalid Mammadov <
> khalidmammad...@gmail.com> wrote:
>
>> Hi Hyukjin
>> That's great initiative, here is a PR that address one of those
>> issues that's waiting for review:
>> https://github.com/apache/spark/pull/37408
>>
>> Perhaps, it would be also good to track these pending issues
>> somewhere to avoid effort duplication.
>>
>> For example, I would like to pick up *union* and *union all* if
>> no one has already.
>>
>> Thanks,
>> Khalid
>>
>>
>> On Mon, Aug 8, 2022 at 1:44 PM Hyukjin Kwon 
>> wrote:
>>
>>> Hi all,
>>>
>>> I am trying to improve PySpark documentation especially:
>>>
>>>- Make the examples self-contained, e.g.,
>>>
>>> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html
>>>- Document Parameters
>>>
>>> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html#pandas.DataFrame.pivot.
>>>There are many API that misses parameters in PySpark, e.g., 
>>> DataFrame.union
>>>
>>> Here is one example PR I am working on:
>>> https://github.com/apache/spark/pull/37437
>>> I can't do it all by myself. Any help, review, and contributions
>>> would be welcome and appreciated.
>>>
>>> Thank you all in advance.
>>>
>>

 --
 Best!
 Qian SUN

>>> --
>> Best!
>> Qian SUN
>>
>


[Structured Streaming + Kafka] Reduced support for alternative offset management

2022-08-30 Thread Martin Andersson
I was looking around for some documentation regarding how checkpointing (or 
rather, delivery semantics) is done when consuming from kafka with structured 
streaming and I stumbled across this old documentation (that still somehow 
exists in latest versions) at 
https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#checkpoints.

This page (which I assume is from around the time of Spark 2.4?) describes that 
storing offsets using checkpoiting is the least reliable method and goes 
further into how to use kafka or an external storage to commit offsets.

It also says
If you enable Spark checkpointing, offsets will be stored in the checkpoint. 
(...) Furthermore, you cannot recover from a checkpoint if your application 
code has changed.

This all leaves me with several questions:

  1.  Is the above quote still true for Spark 3, that the checkpoint will break 
if you change the code? How about changing the subscribe pattern?

  2.  Why was the option to manually commit offsets asynchronously to kafka 
removed when it was deemed more reliable than checkpointing? Not to mention 
that storing offsets in kafka allows you to use all the tools offered in the 
kafka distribution to easily reset/rewind offsets on specific topics, which 
doesn't seem to be possible when using checkpoints.

  3.
>From a user perspective, storing offsets in kafka offers more features. From a 
>developer perspective, having to re-implement offset storage with 
>checkpointing across several output systems (such as HDFS, AWS S3 and other 
>object storages) seems like a lot of unnecessary work and re-inventing the 
>wheel.
Is the discussion leading up to the decision to only support storing offsets 
with checkpointing documented anywhere, perhaps in a jira?

Thanks for your time