Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

John Zhuge Tue, 26 Oct 2021 09:32:36 -0700

+1  Nicely done!

On Tue, Oct 26, 2021 at 8:08 AM Chao Sun <[email protected]> wrote:


> Oops, sorry. I just fixed the permission setting.
>
> Thanks everyone for the positive support!
>
> On Tue, Oct 26, 2021 at 7:30 AM Wenchen Fan <[email protected]> wrote:
>
>> +1 to this SPIP and nice writeup of the design doc!
>>
>> Can we open comment permission in the doc so that we can discuss details
>> there?
>>
>> On Tue, Oct 26, 2021 at 8:29 PM Hyukjin Kwon <[email protected]> wrote:
>>
>>> Seems making sense to me.
>>>
>>> Would be great to have some feedback from people such as @Wenchen Fan
>>> <[email protected]> @Cheng Su <[email protected]> @angers zhu
>>> <[email protected]>.
>>>
>>>
>>> On Tue, 26 Oct 2021 at 17:25, Dongjoon Hyun <[email protected]>
>>> wrote:
>>>
>>>> +1 for this SPIP.
>>>>
>>>> On Sun, Oct 24, 2021 at 9:59 AM huaxin gao <[email protected]>
>>>> wrote:
>>>>
>>>>> +1. Thanks for lifting the current restrictions on bucket join and
>>>>> making this more generalized.
>>>>>
>>>>> On Sun, Oct 24, 2021 at 9:33 AM Ryan Blue <[email protected]> wrote:
>>>>>
>>>>>> +1 from me as well. Thanks Chao for doing so much to get it to this
>>>>>> point!
>>>>>>
>>>>>> On Sat, Oct 23, 2021 at 11:29 PM DB Tsai <[email protected]> wrote:
>>>>>>
>>>>>>> +1 on this SPIP.
>>>>>>>
>>>>>>> This is a more generalized version of bucketed tables and bucketed
>>>>>>> joins which can eliminate very expensive data shuffles when joins,
>>>>>>> and
>>>>>>> many users in the Apache Spark community have wanted this feature for
>>>>>>> a long time!
>>>>>>>
>>>>>>> Thank you, Ryan and Chao, for working on this, and I look forward to
>>>>>>> it as a new feature in Spark 3.3
>>>>>>>
>>>>>>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>>>>>>
>>>>>>> On Fri, Oct 22, 2021 at 12:18 PM Chao Sun <[email protected]>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > Hi,
>>>>>>> >
>>>>>>> > Ryan and I drafted a design doc to support a new type of join:
>>>>>>> storage partitioned join which covers bucket join support for 
>>>>>>> DataSourceV2
>>>>>>> but is more general. The goal is to let Spark leverage distribution
>>>>>>> properties reported by data sources and eliminate shuffle whenever 
>>>>>>> possible.
>>>>>>> >
>>>>>>> > Design doc:
>>>>>>> https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE
>>>>>>> (includes a POC link at the end)
>>>>>>> >
>>>>>>> > We'd like to start a discussion on the doc and any feedback is
>>>>>>> welcome!
>>>>>>> >
>>>>>>> > Thanks,
>>>>>>> > Chao
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>>
>>>>>

-- 
John Zhuge

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

Reply via email to