+1 for the SPIP. This is a great improvement and optimization! On 2021/10/26 19:01:03, Erik Krogen <[email protected]> wrote: > It's great to see this SPIP going live. Once this is complete, it will > really help Spark to play nicely with a broader data ecosystem (Hive, > Iceberg, Trino, etc.), and it's great to see that besides just bringing the > existing bucketed-join support to V2, we are also making the types of > partitioning that can be accommodated more broad and leaving open pathways > for future optimizations like partially clustered distributions. > > Big thanks to Ryan and Chao! > > On Tue, Oct 26, 2021 at 10:35 AM Cheng Su <[email protected]> wrote: > > > +1 for this. This is exciting movement to efficiently read bucketed table > > from other systems (Hive, Trino & Presto)! > > > > > > > > Still looking at the details but having some early questions: > > > > > > > > 1. Is migrating Hive table read path to data source v2, being a > > prerequisite of this SPIP? > > > > > > > > Hive table read path is currently a mix of data source v1 (for Parquet & > > ORC file format only), and legacy Hive code path (HiveTableScanExec). In > > the SPIP, I am seeing we only make change for data source v2, so wondering > > how this would work with existing Hive table read path. In addition, just > > FYI, supporting writing Hive bucketed table is merged in master recently ( > > SPARK-19256 <https://issues.apache.org/jira/browse/SPARK-19256> has > > details). > > > > > > > > 1. Would aggregate work automatically after the SPIP? > > > > > > > > Another major benefit for having bucketed table, is to avoid shuffle > > before aggregate. Just want to bring to our attention that it would be > > great to consider aggregate as well when doing this proposal. > > > > > > > > 1. Any major use cases in mind except Hive bucketed table? > > > > > > > > Just curious if there’s any other use cases we are targeting as part of > > SPIP. > > > > > > > > Thanks, > > > > Cheng Su > > > > > > > > > > > > > > > > *From: *Ryan Blue <[email protected]> > > *Date: *Tuesday, October 26, 2021 at 9:39 AM > > *To: *John Zhuge <[email protected]> > > *Cc: *Chao Sun <[email protected]>, Wenchen Fan <[email protected]>, > > Cheng Su <[email protected]>, DB Tsai <[email protected]>, Dongjoon Hyun < > > [email protected]>, Hyukjin Kwon <[email protected]>, Wenchen Fan > > <[email protected]>, angers zhu <[email protected]>, dev < > > [email protected]>, huaxin gao <[email protected]> > > *Subject: *Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2 > > > > Instead of commenting on the doc, could we keep discussion here on the dev > > list please? That way more people can follow it and there is more room for > > discussion. Comment threads have a very small area and easily become hard > > to follow. > > > > > > > > Ryan > > > > > > > > On Tue, Oct 26, 2021 at 9:32 AM John Zhuge <[email protected]> wrote: > > > > +1 Nicely done! > > > > > > > > On Tue, Oct 26, 2021 at 8:08 AM Chao Sun <[email protected]> wrote: > > > > Oops, sorry. I just fixed the permission setting. > > > > > > > > Thanks everyone for the positive support! > > > > > > > > On Tue, Oct 26, 2021 at 7:30 AM Wenchen Fan <[email protected]> wrote: > > > > +1 to this SPIP and nice writeup of the design doc! > > > > > > > > Can we open comment permission in the doc so that we can discuss details > > there? > > > > > > > > On Tue, Oct 26, 2021 at 8:29 PM Hyukjin Kwon <[email protected]> wrote: > > > > Seems making sense to me. > > > > Would be great to have some feedback from people such as @Wenchen Fan > > <[email protected]> @Cheng Su <[email protected]> @angers zhu > > <[email protected]>. > > > > > > > > > > > > On Tue, 26 Oct 2021 at 17:25, Dongjoon Hyun <[email protected]> > > wrote: > > > > +1 for this SPIP. > > > > > > > > On Sun, Oct 24, 2021 at 9:59 AM huaxin gao <[email protected]> wrote: > > > > +1. Thanks for lifting the current restrictions on bucket join and making > > this more generalized. > > > > > > > > On Sun, Oct 24, 2021 at 9:33 AM Ryan Blue <[email protected]> wrote: > > > > +1 from me as well. Thanks Chao for doing so much to get it to this point! > > > > > > > > On Sat, Oct 23, 2021 at 11:29 PM DB Tsai <[email protected]> wrote: > > > > +1 on this SPIP. > > > > This is a more generalized version of bucketed tables and bucketed > > joins which can eliminate very expensive data shuffles when joins, and > > many users in the Apache Spark community have wanted this feature for > > a long time! > > > > Thank you, Ryan and Chao, for working on this, and I look forward to > > it as a new feature in Spark 3.3 > > > > DB Tsai | https://www.dbtsai.com/ | PGP 42E5B25A8F7A82C1 > > > > On Fri, Oct 22, 2021 at 12:18 PM Chao Sun <[email protected]> wrote: > > > > > > Hi, > > > > > > Ryan and I drafted a design doc to support a new type of join: storage > > partitioned join which covers bucket join support for DataSourceV2 but is > > more general. The goal is to let Spark leverage distribution properties > > reported by data sources and eliminate shuffle whenever possible. > > > > > > Design doc: > > https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE > > (includes a POC link at the end) > > > > > > We'd like to start a discussion on the doc and any feedback is welcome! > > > > > > Thanks, > > > Chao > > > > > > > > > > -- > > > > Ryan Blue > > > > > > > > > > -- > > > > John Zhuge > > > > > > > > > > -- > > > > Ryan Blue > > >
--------------------------------------------------------------------- To unsubscribe e-mail: [email protected]
