Ryan, Michael and I discussed this offline today. Some notes here:

His use case is to support partitioning data by derived columns, rather
than physical columns, because he didn't want his users to keep adding the
"date" column when in reality they are purely derived from some timestamp
column. We reached consensus on this is a great use case and something we
should support.

We are still debating how to do this at API level. Two options:

*Option 1.* Create a smaller surfaced, parallel Expression library, and use
that for specifying partition columns. The bare minimum class hierarchy
would look like:

trait Expression

class NamedFunction(name: String, args: Seq[Expression]) extends Expression

class Literal(value: Any) extends Expression

class ColumnReference(name: String) extends Expression

These classes don't define how the expressions are evaluated, and it'd be
up to the data sources to interpret them. As an example, for a table
partitioned by date(ts), Spark would pass the following to the underlying
ds:

NamedFunction("date", ColumnReference("timestamp") :: Nil)


*Option 2.* Spark passes strings over to the data sources. For the above
example, Spark simply passes "date(ts)" as a string over.


The pros/cons of 1 vs 2 are basically the inverse of each other. Option 1
creates more rigid structure, with extra complexity in API design. Option 2
is less structured but more flexible. Option 1 gives Spark the opportunity
to enforce column references are valid (but not the actual function names),
whereas option 2 would be up to the data sources to validate.



On Wed, Aug 15, 2018 at 2:27 PM Ryan Blue <rb...@netflix.com> wrote:

> I think I found a good solution to the problem of using Expression in the
> TableCatalog API and in the DeleteSupport API.
>
> For DeleteSupport, there is already a stable and public subset of
> Expression named Filter that can be used to pass filters. The reason why
> DeleteSupport would use Expression is to support more complex expressions
> like to_date(ts) = '2018-08-15' that are translated to ts >=
> 1534316400000000 AND ts < 1534402800000000. But, this can be done in
> Spark instead of the data sources so I think DeleteSupport should use
> Filter instead. I updated the DeleteSupport PR #21308
> <https://github.com/apache/spark/pull/21308> with these changes.
>
> Also, I agree that the DataSourceV2 API should also not expose Expression,
> so I opened SPARK-25127 to track removing SupportsPushDownCatalystFilter
> <https://issues.apache.org/jira/browse/SPARK-25127>.
>
> For TableCatalog, I took a similar approach instead of introducing a
> parallel Expression API. Instead, I created a PartitionTransform API (like
> Filter) that communicates the transformation function, function parameters
> like num buckets, and column references. I updated the TableCatalog PR
> #21306 <https://github.com/apache/spark/pull/21306> to use
> PartitionTransform instead of Expression and I updated the text of the SPIP
> doc
> <https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#heading=h.m45webtwxf2d>
> .
>
> I also raised a concern about needing to wait for Spark to add support for
> new expressions (now partition transforms). To get around this, I added an
> apply transform that passes the name of a function and an input column.
> That way, users can still pass transforms that Spark doesn’t know about by
> name to data sources: apply("source_function", "colName").
>
> Please have a look at the updated pull requests and SPIP doc and comment!
>
> rb
>

Reply via email to