Reynold, did you get a chance to look at my response about using `Expression`? I think that it's okay since it is already exposed in the v2 data source API. Plus, I wouldn't want to block this on building a public expression API that is more stable.
I think that's the only objection to this SPIP. Anyone else want to raise an issue with the proposal, or is it about time to bring up a vote thread? rb On Thu, Jul 26, 2018 at 5:00 PM Ryan Blue <rb...@netflix.com> wrote: > I don’t think that we want to block this work until we have a public and > stable Expression. Like our decision to expose InternalRow, I think that > while this option isn’t great, it at least allows us to move forward. We > can hopefully replace it later. > > Also note that the use of Expression is in the plug-in API, not in the > public API. I think that it is easier to expect data source implementations > to handle some instability here. We already use Expression as an option > for push-down in DSv2 so there’s precedent for it. Plus, we need to be able > to pass more complex expressions between the sources and Spark for sorting > and clustering data when it’s written to DSv2 (SPARK-23889 > <https://issues.apache.org/jira/browse/SPARK-23889>). > > Simple expressions for bucketing and column-based partitions would almost > certainly be stable. We can probably find a trade-off solution to not use > Expression in the TableCatalog API, but we definitely need expressions for > SPARK-23889. > > SortOrder would be easier to replace with a more strict class based on > only column data rather than expressions. For #21306 > <https://github.com/apache/spark/pull/21306>, I just left it out > entirely. What if I just removed it from the proposal and we can add it > later? > > > On Thu, Jul 26, 2018 at 4:32 PM Reynold Xin <r...@databricks.com> wrote: > >> Seems reasonable at high level. I don't think we can use Expression's and >> SortOrder's in public APIs though. Those are not meant to be public and can >> break easily across versions. >> >> >> On Tue, Jul 24, 2018 at 9:26 AM Ryan Blue <rb...@netflix.com.invalid> >> wrote: >> >>> The recently adopted SPIP to standardize logical plans requires a way >>> for to plug in providers for table metadata operations, so that the new >>> plans can create and drop tables. I proposed an API to do this in a >>> follow-up SPIP on APIs for Table Metadata Operations >>> <https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#>. >>> This thread is to discuss that proposal. >>> >>> There are two main parts: >>> >>> - A public facing API for creating, altering, and dropping tables >>> - An API for catalog implementations to provide the underlying table >>> operations >>> >>> The main need is for the plug-in API, but I included the public one >>> because there isn’t currently a friendly public API to create tables and I >>> think it helps to see how both would work together. >>> >>> Here’s a sample of the proposed public API: >>> >>> catalog.createTable("db.table") >>> .addColumn("id", LongType) >>> .addColumn("data", StringType, nullable=true) >>> .addColumn("ts", TimestampType) >>> .partitionBy(day($"ts")) >>> .config("prop", "val") >>> .commit() >>> >>> And here’s a sample of the catalog plug-in API: >>> >>> Table createTable( >>> TableIdentifier ident, >>> StructType schema, >>> List<Expression> partitions, >>> Optional<List<SortOrder>> sortOrder, >>> Map<String, String> properties) >>> >>> Note that this API passes both bucketing and column-based partitioning >>> as Expressions. This is a generalization that makes it possible for the >>> table to use the relationship between columns and partitions. In the >>> example above, data is partitioned by the day of the timestamp field. >>> Because the expression is passed to the table, the table can use predicates >>> on the timestamp to filter out partitions without an explicit partition >>> predicate. There’s more detail in the proposal on this. >>> >>> The SPIP is for the APIs and does not cover how multiple catalogs would >>> be exposed. I started a separate discussion thread on how to access >>> multiple catalogs and maintain compatibility with Spark’s current behavior >>> (how to get the catalog instance in the above example). >>> >>> Please use this thread to discuss the proposed APIs. Thanks, everyone! >>> >>> rb >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> > > -- > Ryan Blue > Software Engineer > Netflix > -- Ryan Blue Software Engineer Netflix