Seems making sense to me. Would be great to have some feedback from people such as @Wenchen Fan <wenc...@databricks.com> @Cheng Su <chen...@fb.com> @angers zhu <angers....@gmail.com>.
On Tue, 26 Oct 2021 at 17:25, Dongjoon Hyun <dongjoon.h...@gmail.com> wrote: > +1 for this SPIP. > > On Sun, Oct 24, 2021 at 9:59 AM huaxin gao <huaxin.ga...@gmail.com> wrote: > >> +1. Thanks for lifting the current restrictions on bucket join and making >> this more generalized. >> >> On Sun, Oct 24, 2021 at 9:33 AM Ryan Blue <b...@apache.org> wrote: >> >>> +1 from me as well. Thanks Chao for doing so much to get it to this >>> point! >>> >>> On Sat, Oct 23, 2021 at 11:29 PM DB Tsai <dbt...@dbtsai.com> wrote: >>> >>>> +1 on this SPIP. >>>> >>>> This is a more generalized version of bucketed tables and bucketed >>>> joins which can eliminate very expensive data shuffles when joins, and >>>> many users in the Apache Spark community have wanted this feature for >>>> a long time! >>>> >>>> Thank you, Ryan and Chao, for working on this, and I look forward to >>>> it as a new feature in Spark 3.3 >>>> >>>> DB Tsai | https://www.dbtsai.com/ | PGP 42E5B25A8F7A82C1 >>>> >>>> On Fri, Oct 22, 2021 at 12:18 PM Chao Sun <sunc...@apache.org> wrote: >>>> > >>>> > Hi, >>>> > >>>> > Ryan and I drafted a design doc to support a new type of join: >>>> storage partitioned join which covers bucket join support for DataSourceV2 >>>> but is more general. The goal is to let Spark leverage distribution >>>> properties reported by data sources and eliminate shuffle whenever >>>> possible. >>>> > >>>> > Design doc: >>>> https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE >>>> (includes a POC link at the end) >>>> > >>>> > We'd like to start a discussion on the doc and any feedback is >>>> welcome! >>>> > >>>> > Thanks, >>>> > Chao >>>> >>> >>> >>> -- >>> Ryan Blue >>> >>