+1 from me as well. Thanks Chao for doing so much to get it to this point! On Sat, Oct 23, 2021 at 11:29 PM DB Tsai <dbt...@dbtsai.com> wrote:
> +1 on this SPIP. > > This is a more generalized version of bucketed tables and bucketed > joins which can eliminate very expensive data shuffles when joins, and > many users in the Apache Spark community have wanted this feature for > a long time! > > Thank you, Ryan and Chao, for working on this, and I look forward to > it as a new feature in Spark 3.3 > > DB Tsai | https://www.dbtsai.com/ | PGP 42E5B25A8F7A82C1 > > On Fri, Oct 22, 2021 at 12:18 PM Chao Sun <sunc...@apache.org> wrote: > > > > Hi, > > > > Ryan and I drafted a design doc to support a new type of join: storage > partitioned join which covers bucket join support for DataSourceV2 but is > more general. The goal is to let Spark leverage distribution properties > reported by data sources and eliminate shuffle whenever possible. > > > > Design doc: > https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE > (includes a POC link at the end) > > > > We'd like to start a discussion on the doc and any feedback is welcome! > > > > Thanks, > > Chao > -- Ryan Blue