Reynold, did you get a chance to look at my response about using
`Expression`? I think that it's okay since it is already exposed in the v2
data source API. Plus, I wouldn't want to block this on building a public
expression API that is more stable.

I think that's the only objection to this SPIP. Anyone else want to raise
an issue with the proposal, or is it about time to bring up a vote thread?

rb

On Thu, Jul 26, 2018 at 5:00 PM Ryan Blue <rb...@netflix.com> wrote:

> I don’t think that we want to block this work until we have a public and
> stable Expression. Like our decision to expose InternalRow, I think that
> while this option isn’t great, it at least allows us to move forward. We
> can hopefully replace it later.
>
> Also note that the use of Expression is in the plug-in API, not in the
> public API. I think that it is easier to expect data source implementations
> to handle some instability here. We already use Expression as an option
> for push-down in DSv2 so there’s precedent for it. Plus, we need to be able
> to pass more complex expressions between the sources and Spark for sorting
> and clustering data when it’s written to DSv2 (SPARK-23889
> <https://issues.apache.org/jira/browse/SPARK-23889>).
>
> Simple expressions for bucketing and column-based partitions would almost
> certainly be stable. We can probably find a trade-off solution to not use
> Expression in the TableCatalog API, but we definitely need expressions for
> SPARK-23889.
>
> SortOrder would be easier to replace with a more strict class based on
> only column data rather than expressions. For #21306
> <https://github.com/apache/spark/pull/21306>, I just left it out
> entirely. What if I just removed it from the proposal and we can add it
> later?
> ​
>
> On Thu, Jul 26, 2018 at 4:32 PM Reynold Xin <r...@databricks.com> wrote:
>
>> Seems reasonable at high level. I don't think we can use Expression's and
>> SortOrder's in public APIs though. Those are not meant to be public and can
>> break easily across versions.
>>
>>
>> On Tue, Jul 24, 2018 at 9:26 AM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> The recently adopted SPIP to standardize logical plans requires a way
>>> for to plug in providers for table metadata operations, so that the new
>>> plans can create and drop tables. I proposed an API to do this in a
>>> follow-up SPIP on APIs for Table Metadata Operations
>>> <https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#>.
>>> This thread is to discuss that proposal.
>>>
>>> There are two main parts:
>>>
>>>    - A public facing API for creating, altering, and dropping tables
>>>    - An API for catalog implementations to provide the underlying table
>>>    operations
>>>
>>> The main need is for the plug-in API, but I included the public one
>>> because there isn’t currently a friendly public API to create tables and I
>>> think it helps to see how both would work together.
>>>
>>> Here’s a sample of the proposed public API:
>>>
>>> catalog.createTable("db.table")
>>>     .addColumn("id", LongType)
>>>     .addColumn("data", StringType, nullable=true)
>>>     .addColumn("ts", TimestampType)
>>>     .partitionBy(day($"ts"))
>>>     .config("prop", "val")
>>>     .commit()
>>>
>>> And here’s a sample of the catalog plug-in API:
>>>
>>> Table createTable(
>>>     TableIdentifier ident,
>>>     StructType schema,
>>>     List<Expression> partitions,
>>>     Optional<List<SortOrder>> sortOrder,
>>>     Map<String, String> properties)
>>>
>>> Note that this API passes both bucketing and column-based partitioning
>>> as Expressions. This is a generalization that makes it possible for the
>>> table to use the relationship between columns and partitions. In the
>>> example above, data is partitioned by the day of the timestamp field.
>>> Because the expression is passed to the table, the table can use predicates
>>> on the timestamp to filter out partitions without an explicit partition
>>> predicate. There’s more detail in the proposal on this.
>>>
>>> The SPIP is for the APIs and does not cover how multiple catalogs would
>>> be exposed. I started a separate discussion thread on how to access
>>> multiple catalogs and maintain compatibility with Spark’s current behavior
>>> (how to get the catalog instance in the above example).
>>>
>>> Please use this thread to discuss the proposed APIs. Thanks, everyone!
>>>
>>> rb
>>> ​
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to