Re: [DISCUSS] SPIP: APIs for Table Metadata Operations

Reynold Xin Tue, 04 Sep 2018 21:16:35 -0700

Russell your special columns wouldn’t actually work with option 1 because
Spark would have to fail them in analysis without an actual physical
column.


On Tue, Sep 4, 2018 at 9:12 PM Russell Spitzer <russell.spit...@gmail.com>
wrote:

> I'm a big fan of 1 as well. I had to implement something similar using
> custom expressions and it was a bit more work than it should be. In
> particular our use case is that columns have certain metadata (ttl,
> writetime) which exist not as separate columns but as special values which
> can be surfaced.
>
> I still don't have a good solution for the same thing at write-time though
> since the problem is a bit asymmetric for us. While you can read a metadata
> from any particular cell, on write you specify it for the whole row.
>
> On Tue, Sep 4, 2018 at 11:04 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Thanks for posting the summary. I'm strongly in favor of option 1.
>>
>> I think that API footprint is fairly small, but worth it. Not only does
>> it make sources easier to implement by handling parsing, it also makes
>> sources more reliable because Spark handles validation the same way across
>> sources.
>>
>> A good example is making sure that the referenced columns exist in the
>> table, which should be done using the case sensitivity of the analyzer.
>> Spark would pass normalized column names that match the case of the
>> declared columns to ensure that there isn't a problem if Spark is case
>> insensitive but the source doesn't implement it. And the source wouldn't
>> have to know about Spark's case sensitivity settings at all.
>>
>> On Tue, Sep 4, 2018 at 7:46 PM Reynold Xin <r...@databricks.com> wrote:
>>
>>> Ryan, Michael and I discussed this offline today. Some notes here:
>>>
>>> His use case is to support partitioning data by derived columns, rather
>>> than physical columns, because he didn't want his users to keep adding the
>>> "date" column when in reality they are purely derived from some timestamp
>>> column. We reached consensus on this is a great use case and something we
>>> should support.
>>>
>>> We are still debating how to do this at API level. Two options:
>>>
>>> *Option 1.* Create a smaller surfaced, parallel Expression library, and
>>> use that for specifying partition columns. The bare minimum class hierarchy
>>> would look like:
>>>
>>> trait Expression
>>>
>>> class NamedFunction(name: String, args: Seq[Expression]) extends
>>> Expression
>>>
>>> class Literal(value: Any) extends Expression
>>>
>>> class ColumnReference(name: String) extends Expression
>>>
>>> These classes don't define how the expressions are evaluated, and it'd
>>> be up to the data sources to interpret them. As an example, for a table
>>> partitioned by date(ts), Spark would pass the following to the underlying
>>> ds:
>>>
>>> NamedFunction("date", ColumnReference("timestamp") :: Nil)
>>>
>>>
>>> *Option 2.* Spark passes strings over to the data sources. For the
>>> above example, Spark simply passes "date(ts)" as a string over.
>>>
>>>
>>> The pros/cons of 1 vs 2 are basically the inverse of each other. Option
>>> 1 creates more rigid structure, with extra complexity in API design. Option
>>> 2 is less structured but more flexible. Option 1 gives Spark the
>>> opportunity to enforce column references are valid (but not the actual
>>> function names), whereas option 2 would be up to the data sources to
>>> validate.
>>>
>>>
>>>
>>> On Wed, Aug 15, 2018 at 2:27 PM Ryan Blue <rb...@netflix.com> wrote:
>>>
>>>> I think I found a good solution to the problem of using Expression in
>>>> the TableCatalog API and in the DeleteSupport API.
>>>>
>>>> For DeleteSupport, there is already a stable and public subset of
>>>> Expression named Filter that can be used to pass filters. The reason why
>>>> DeleteSupport would use Expression is to support more complex expressions
>>>> like to_date(ts) = '2018-08-15' that are translated to ts >=
>>>> 1534316400000000 AND ts < 1534402800000000. But, this can be done in
>>>> Spark instead of the data sources so I think DeleteSupport should use
>>>> Filter instead. I updated the DeleteSupport PR #21308
>>>> <https://github.com/apache/spark/pull/21308> with these changes.
>>>>
>>>> Also, I agree that the DataSourceV2 API should also not expose
>>>> Expression, so I opened SPARK-25127 to track removing
>>>> SupportsPushDownCatalystFilter
>>>> <https://issues.apache.org/jira/browse/SPARK-25127>.
>>>>
>>>> For TableCatalog, I took a similar approach instead of introducing a
>>>> parallel Expression API. Instead, I created a PartitionTransform API (like
>>>> Filter) that communicates the transformation function, function parameters
>>>> like num buckets, and column references. I updated the TableCatalog PR
>>>> #21306 <https://github.com/apache/spark/pull/21306> to use
>>>> PartitionTransform instead of Expression and I updated the text of the SPIP
>>>> doc
>>>> <https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#heading=h.m45webtwxf2d>
>>>> .
>>>>
>>>> I also raised a concern about needing to wait for Spark to add support
>>>> for new expressions (now partition transforms). To get around this, I added
>>>> an apply transform that passes the name of a function and an input
>>>> column. That way, users can still pass transforms that Spark doesn’t know
>>>> about by name to data sources: apply("source_function", "colName").
>>>>
>>>> Please have a look at the updated pull requests and SPIP doc and
>>>> comment!
>>>>
>>>> rb
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

Re: [DISCUSS] SPIP: APIs for Table Metadata Operations

Reply via email to