Re: Update Spark 3.3 release window?

2021-10-28 Thread Jungtaek Lim
+1 for mid-March 2022.

+1 for EOL 2.x as well. I guess we did it already according to
Dongjoon's quote from the Spark website.

On Fri, Oct 29, 2021 at 3:49 AM Dongjoon Hyun 
wrote:

> +1 for mid March for Spark 3.3.
>
> For 2.4, our document already mentioned its EOL like
>
> " For example, 2.4.0 was released in November 2nd 2018 and had been
> maintained for 31 months until 2.4.8 was released on May 2021. 2.4.8 is the
> last release and no more 2.4.x releases should be expected even for bug
> fixes."
>
> Do we need somthing more explicit?
>
> Anyway, I'm +1 for that too if needed.
>
> Dongjoon
>
> On Thu, Oct 28, 2021 at 8:07 AM Gengliang Wang  wrote:
>
>> +1, Mid-March 2022 sounds good.
>>
>>
>> Gengliang
>>
>> On Thu, Oct 28, 2021 at 10:54 PM Tom Graves 
>> wrote:
>>
>>> +1 for updating, mid march sounds good.  I'm also fine with EOL 2.x.
>>>
>>> Tom
>>>
>>> On Thursday, October 28, 2021, 09:37:00 AM CDT, Mridul Muralidharan <
>>> mri...@gmail.com> wrote:
>>>
>>>
>>>
>>> +1 to EOL 2.x
>>> Mid march sounds like a good placeholder for 3.3.
>>>
>>> Regards,
>>> Mridul
>>>
>>> On Wed, Oct 27, 2021 at 10:38 PM Sean Owen  wrote:
>>>
>>> Seems fine to me - as good a placeholder as anything.
>>> Would that be about time to call 2.x end-of-life?
>>>
>>> On Wed, Oct 27, 2021 at 9:36 PM Hyukjin Kwon 
>>> wrote:
>>>
>>> Hi all,
>>>
>>> Spark 3.2. is out. Shall we update the release window
>>> https://spark.apache.org/versioning-policy.html?
>>> I am thinking of Mid March 2022 (5 months after the 3.2 release) for
>>> code freeze and onward.
>>>
>>>


Re: Update Spark 3.3 release window?

2021-10-28 Thread Dongjoon Hyun
+1 for mid March for Spark 3.3.

For 2.4, our document already mentioned its EOL like

" For example, 2.4.0 was released in November 2nd 2018 and had been
maintained for 31 months until 2.4.8 was released on May 2021. 2.4.8 is the
last release and no more 2.4.x releases should be expected even for bug
fixes."

Do we need somthing more explicit?

Anyway, I'm +1 for that too if needed.

Dongjoon

On Thu, Oct 28, 2021 at 8:07 AM Gengliang Wang  wrote:

> +1, Mid-March 2022 sounds good.
>
>
> Gengliang
>
> On Thu, Oct 28, 2021 at 10:54 PM Tom Graves 
> wrote:
>
>> +1 for updating, mid march sounds good.  I'm also fine with EOL 2.x.
>>
>> Tom
>>
>> On Thursday, October 28, 2021, 09:37:00 AM CDT, Mridul Muralidharan <
>> mri...@gmail.com> wrote:
>>
>>
>>
>> +1 to EOL 2.x
>> Mid march sounds like a good placeholder for 3.3.
>>
>> Regards,
>> Mridul
>>
>> On Wed, Oct 27, 2021 at 10:38 PM Sean Owen  wrote:
>>
>> Seems fine to me - as good a placeholder as anything.
>> Would that be about time to call 2.x end-of-life?
>>
>> On Wed, Oct 27, 2021 at 9:36 PM Hyukjin Kwon  wrote:
>>
>> Hi all,
>>
>> Spark 3.2. is out. Shall we update the release window
>> https://spark.apache.org/versioning-policy.html?
>> I am thinking of Mid March 2022 (5 months after the 3.2 release) for code
>> freeze and onward.
>>
>>


Re: Update Spark 3.3 release window?

2021-10-28 Thread Gengliang Wang
+1, Mid-March 2022 sounds good.

Gengliang

On Thu, Oct 28, 2021 at 10:54 PM Tom Graves 
wrote:

> +1 for updating, mid march sounds good.  I'm also fine with EOL 2.x.
>
> Tom
>
> On Thursday, October 28, 2021, 09:37:00 AM CDT, Mridul Muralidharan <
> mri...@gmail.com> wrote:
>
>
>
> +1 to EOL 2.x
> Mid march sounds like a good placeholder for 3.3.
>
> Regards,
> Mridul
>
> On Wed, Oct 27, 2021 at 10:38 PM Sean Owen  wrote:
>
> Seems fine to me - as good a placeholder as anything.
> Would that be about time to call 2.x end-of-life?
>
> On Wed, Oct 27, 2021 at 9:36 PM Hyukjin Kwon  wrote:
>
> Hi all,
>
> Spark 3.2. is out. Shall we update the release window
> https://spark.apache.org/versioning-policy.html?
> I am thinking of Mid March 2022 (5 months after the 3.2 release) for code
> freeze and onward.
>
>


Re: Update Spark 3.3 release window?

2021-10-28 Thread Tom Graves
 +1 for updating, mid march sounds good.  I'm also fine with EOL 2.x.
Tom 
On Thursday, October 28, 2021, 09:37:00 AM CDT, Mridul Muralidharan 
 wrote:  
 
 
+1 to EOL 2.xMid march sounds like a good placeholder for 3.3.
Regards,Mridul 
On Wed, Oct 27, 2021 at 10:38 PM Sean Owen  wrote:

Seems fine to me - as good a placeholder as anything.Would that be about time 
to call 2.x end-of-life?
On Wed, Oct 27, 2021 at 9:36 PM Hyukjin Kwon  wrote:

Hi all,
Spark 3.2. is out. Shall we update the release window 
https://spark.apache.org/versioning-policy.html?
I am thinking of Mid March 2022 (5 months after the 3.2 release) for code 
freeze and onward.


  

Re: Update Spark 3.3 release window?

2021-10-28 Thread Mridul Muralidharan
+1 to EOL 2.x
Mid march sounds like a good placeholder for 3.3.

Regards,
Mridul

On Wed, Oct 27, 2021 at 10:38 PM Sean Owen  wrote:

> Seems fine to me - as good a placeholder as anything.
> Would that be about time to call 2.x end-of-life?
>
> On Wed, Oct 27, 2021 at 9:36 PM Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> Spark 3.2. is out. Shall we update the release window
>> https://spark.apache.org/versioning-policy.html?
>> I am thinking of Mid March 2022 (5 months after the 3.2 release) for code
>> freeze and onward.
>>
>>


Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-28 Thread Wenchen Fan
Thanks for the explanation! It makes sense to always resolve the logical
transforms to concrete implementations, and check the concrete
implementations to decide compatible partitions. We can discuss more
details in the PR later.

On Thu, Oct 28, 2021 at 4:14 AM Ryan Blue  wrote:

> The transform expressions in v2 are logical, not concrete implementations.
> Even days may have different implementations -- the only expectation is
> that the partitions are day-sized. For example, you could use a transform
> that splits days at UTC 00:00, or uses some other day boundary.
>
> Because the expressions are logical, we need to resolve them to
> implementations at some point, like Chao outlines. We can do that using a
> FunctionCatalog, although I think it's worth considering adding an
> interface so that a transform from a Table can be converted into a
> `BoundFunction` directly. That is easier than defining a way for Spark to
> query the function catalog.
>
> In any case, I'm sure it's easy to understand how this works once you get
> a concrete implementation.
>
> On Wed, Oct 27, 2021 at 9:35 AM Wenchen Fan  wrote:
>
>> `BucketTransform` is a builtin partition transform in Spark, instead of a
>> UDF from `FunctionCatalog`. Will Iceberg use UDF from `FunctionCatalog` to
>> represent its bucket transform, or use the Spark builtin `BucketTransform`?
>> I'm asking this because other v2 sources may also use the builtin
>> `BucketTransform` but use a different bucket hash function. Or we can
>> clearly define the bucket hash function of the builtin `BucketTransform` in
>> the doc.
>>
>> On Thu, Oct 28, 2021 at 12:25 AM Ryan Blue  wrote:
>>
>>> Two v2 sources may return different bucket IDs for the same value, and
>>> this breaks the phase 1 split-wise join.
>>>
>>> This is why the FunctionCatalog included a canonicalName method (docs
>>> ).
>>> That method returns an identifier that can be used to compare whether two
>>> bucket function instances are the same.
>>>
>>>
>>>1. Can we apply this idea to partitioned file source tables
>>>(non-bucketed) as well?
>>>
>>> What do you mean here? The design doc discusses transforms like days(ts)
>>> that can be supported in the future. Is that what you’re asking about? Or
>>> are you referring to v1 file sources? I think the goal is to support v2,
>>> since v1 doesn’t have reliable behavior.
>>>
>>> Note that the initial implementation goal is to support bucketing since
>>> that’s an easier case because both sides have the same number of
>>> partitions. More complex storage-partitioned joins can be implemented later.
>>>
>>>
>>>1. What if the table has many partitions? Shall we apply certain
>>>join algorithms in the phase 1 split-wise join as well? Or even launch a
>>>Spark job to do so?
>>>
>>> I think that this proposal opens up a lot of possibilities, like what
>>> you’re suggesting here. It is a bit like AQE. We’ll need to come up with
>>> heuristics for choosing how and when to use storage partitioning in joins.
>>> As I said above, bucketing is a great way to get started because it fills
>>> an existing gap. More complex use cases can be supported over time.
>>>
>>> Ryan
>>>
>>> On Wed, Oct 27, 2021 at 9:08 AM Wenchen Fan  wrote:
>>>
 IIUC, the general idea is to let each input split report its partition
 value, and Spark can perform the join in two phases:
 1. join the input splits from left and right tables according to their
 partitions values and join keys, at the driver side.
 2. for each joined input splits pair (or a group of splits), launch a
 Spark task to join the rows.

 My major concern is about how to define "compatible partitions". Things
 like `days(ts)` are straightforward: the same timestamp value always
 results in the same partition value, in whatever v2 sources. `bucket(col,
 num)` is tricky, as Spark doesn't define the bucket hash function. Two v2
 sources may return different bucket IDs for the same value, and this breaks
 the phase 1 split-wise join.

 And two questions for further improvements:
 1. Can we apply this idea to partitioned file source tables
 (non-bucketed) as well?
 2. What if the table has many partitions? Shall we apply certain join
 algorithms in the phase 1 split-wise join as well? Or even launch a Spark
 job to do so?

 Thanks,
 Wenchen

 On Wed, Oct 27, 2021 at 3:08 AM Chao Sun  wrote:

> Thanks Cheng for the comments.
>
> > Is migrating Hive table read path to data source v2, being a
> prerequisite of this SPIP
>
> Yes, this SPIP only aims at DataSourceV2, so obviously it will help if
> Hive eventually moves to use V2 API. With that said, I think some of the
> ideas could be useful for V1 Hive support as well. For instance, with