Thanks for posting the summary. I'm strongly in favor of option 1.

I think that API footprint is fairly small, but worth it. Not only does it
make sources easier to implement by handling parsing, it also makes sources
more reliable because Spark handles validation the same way across sources.

A good example is making sure that the referenced columns exist in the
table, which should be done using the case sensitivity of the analyzer.
Spark would pass normalized column names that match the case of the
declared columns to ensure that there isn't a problem if Spark is case
insensitive but the source doesn't implement it. And the source wouldn't
have to know about Spark's case sensitivity settings at all.

On Tue, Sep 4, 2018 at 7:46 PM Reynold Xin <r...@databricks.com> wrote:

> Ryan, Michael and I discussed this offline today. Some notes here:
>
> His use case is to support partitioning data by derived columns, rather
> than physical columns, because he didn't want his users to keep adding the
> "date" column when in reality they are purely derived from some timestamp
> column. We reached consensus on this is a great use case and something we
> should support.
>
> We are still debating how to do this at API level. Two options:
>
> *Option 1.* Create a smaller surfaced, parallel Expression library, and
> use that for specifying partition columns. The bare minimum class hierarchy
> would look like:
>
> trait Expression
>
> class NamedFunction(name: String, args: Seq[Expression]) extends Expression
>
> class Literal(value: Any) extends Expression
>
> class ColumnReference(name: String) extends Expression
>
> These classes don't define how the expressions are evaluated, and it'd be
> up to the data sources to interpret them. As an example, for a table
> partitioned by date(ts), Spark would pass the following to the underlying
> ds:
>
> NamedFunction("date", ColumnReference("timestamp") :: Nil)
>
>
> *Option 2.* Spark passes strings over to the data sources. For the above
> example, Spark simply passes "date(ts)" as a string over.
>
>
> The pros/cons of 1 vs 2 are basically the inverse of each other. Option 1
> creates more rigid structure, with extra complexity in API design. Option 2
> is less structured but more flexible. Option 1 gives Spark the opportunity
> to enforce column references are valid (but not the actual function names),
> whereas option 2 would be up to the data sources to validate.
>
>
>
> On Wed, Aug 15, 2018 at 2:27 PM Ryan Blue <rb...@netflix.com> wrote:
>
>> I think I found a good solution to the problem of using Expression in the
>> TableCatalog API and in the DeleteSupport API.
>>
>> For DeleteSupport, there is already a stable and public subset of
>> Expression named Filter that can be used to pass filters. The reason why
>> DeleteSupport would use Expression is to support more complex expressions
>> like to_date(ts) = '2018-08-15' that are translated to ts >=
>> 1534316400000000 AND ts < 1534402800000000. But, this can be done in
>> Spark instead of the data sources so I think DeleteSupport should use
>> Filter instead. I updated the DeleteSupport PR #21308
>> <https://github.com/apache/spark/pull/21308> with these changes.
>>
>> Also, I agree that the DataSourceV2 API should also not expose
>> Expression, so I opened SPARK-25127 to track removing
>> SupportsPushDownCatalystFilter
>> <https://issues.apache.org/jira/browse/SPARK-25127>.
>>
>> For TableCatalog, I took a similar approach instead of introducing a
>> parallel Expression API. Instead, I created a PartitionTransform API (like
>> Filter) that communicates the transformation function, function parameters
>> like num buckets, and column references. I updated the TableCatalog PR
>> #21306 <https://github.com/apache/spark/pull/21306> to use
>> PartitionTransform instead of Expression and I updated the text of the SPIP
>> doc
>> <https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#heading=h.m45webtwxf2d>
>> .
>>
>> I also raised a concern about needing to wait for Spark to add support
>> for new expressions (now partition transforms). To get around this, I added
>> an apply transform that passes the name of a function and an input
>> column. That way, users can still pass transforms that Spark doesn’t know
>> about by name to data sources: apply("source_function", "colName").
>>
>> Please have a look at the updated pull requests and SPIP doc and comment!
>>
>> rb
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to