Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-24 Thread huaxin gao
+1. Thanks for lifting the current restrictions on bucket join and making
this more generalized.

On Sun, Oct 24, 2021 at 9:33 AM Ryan Blue  wrote:

> +1 from me as well. Thanks Chao for doing so much to get it to this point!
>
> On Sat, Oct 23, 2021 at 11:29 PM DB Tsai  wrote:
>
>> +1 on this SPIP.
>>
>> This is a more generalized version of bucketed tables and bucketed
>> joins which can eliminate very expensive data shuffles when joins, and
>> many users in the Apache Spark community have wanted this feature for
>> a long time!
>>
>> Thank you, Ryan and Chao, for working on this, and I look forward to
>> it as a new feature in Spark 3.3
>>
>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>
>> On Fri, Oct 22, 2021 at 12:18 PM Chao Sun  wrote:
>> >
>> > Hi,
>> >
>> > Ryan and I drafted a design doc to support a new type of join: storage
>> partitioned join which covers bucket join support for DataSourceV2 but is
>> more general. The goal is to let Spark leverage distribution properties
>> reported by data sources and eliminate shuffle whenever possible.
>> >
>> > Design doc:
>> https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE
>> (includes a POC link at the end)
>> >
>> > We'd like to start a discussion on the doc and any feedback is welcome!
>> >
>> > Thanks,
>> > Chao
>>
>
>
> --
> Ryan Blue
>


Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-24 Thread Ryan Blue
+1 from me as well. Thanks Chao for doing so much to get it to this point!

On Sat, Oct 23, 2021 at 11:29 PM DB Tsai  wrote:

> +1 on this SPIP.
>
> This is a more generalized version of bucketed tables and bucketed
> joins which can eliminate very expensive data shuffles when joins, and
> many users in the Apache Spark community have wanted this feature for
> a long time!
>
> Thank you, Ryan and Chao, for working on this, and I look forward to
> it as a new feature in Spark 3.3
>
> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>
> On Fri, Oct 22, 2021 at 12:18 PM Chao Sun  wrote:
> >
> > Hi,
> >
> > Ryan and I drafted a design doc to support a new type of join: storage
> partitioned join which covers bucket join support for DataSourceV2 but is
> more general. The goal is to let Spark leverage distribution properties
> reported by data sources and eliminate shuffle whenever possible.
> >
> > Design doc:
> https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE
> (includes a POC link at the end)
> >
> > We'd like to start a discussion on the doc and any feedback is welcome!
> >
> > Thanks,
> > Chao
>


-- 
Ryan Blue


Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-24 Thread DB Tsai
+1 on this SPIP.

This is a more generalized version of bucketed tables and bucketed
joins which can eliminate very expensive data shuffles when joins, and
many users in the Apache Spark community have wanted this feature for
a long time!

Thank you, Ryan and Chao, for working on this, and I look forward to
it as a new feature in Spark 3.3

DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1

On Fri, Oct 22, 2021 at 12:18 PM Chao Sun  wrote:
>
> Hi,
>
> Ryan and I drafted a design doc to support a new type of join: storage 
> partitioned join which covers bucket join support for DataSourceV2 but is 
> more general. The goal is to let Spark leverage distribution properties 
> reported by data sources and eliminate shuffle whenever possible.
>
> Design doc: 
> https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE
>  (includes a POC link at the end)
>
> We'd like to start a discussion on the doc and any feedback is welcome!
>
> Thanks,
> Chao

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org