Russell your special columns wouldn’t actually work with option 1 because Spark would have to fail them in analysis without an actual physical column.
On Tue, Sep 4, 2018 at 9:12 PM Russell Spitzer <russell.spit...@gmail.com> wrote: > I'm a big fan of 1 as well. I had to implement something similar using > custom expressions and it was a bit more work than it should be. In > particular our use case is that columns have certain metadata (ttl, > writetime) which exist not as separate columns but as special values which > can be surfaced. > > I still don't have a good solution for the same thing at write-time though > since the problem is a bit asymmetric for us. While you can read a metadata > from any particular cell, on write you specify it for the whole row. > > On Tue, Sep 4, 2018 at 11:04 PM Ryan Blue <rb...@netflix.com.invalid> > wrote: > >> Thanks for posting the summary. I'm strongly in favor of option 1. >> >> I think that API footprint is fairly small, but worth it. Not only does >> it make sources easier to implement by handling parsing, it also makes >> sources more reliable because Spark handles validation the same way across >> sources. >> >> A good example is making sure that the referenced columns exist in the >> table, which should be done using the case sensitivity of the analyzer. >> Spark would pass normalized column names that match the case of the >> declared columns to ensure that there isn't a problem if Spark is case >> insensitive but the source doesn't implement it. And the source wouldn't >> have to know about Spark's case sensitivity settings at all. >> >> On Tue, Sep 4, 2018 at 7:46 PM Reynold Xin <r...@databricks.com> wrote: >> >>> Ryan, Michael and I discussed this offline today. Some notes here: >>> >>> His use case is to support partitioning data by derived columns, rather >>> than physical columns, because he didn't want his users to keep adding the >>> "date" column when in reality they are purely derived from some timestamp >>> column. We reached consensus on this is a great use case and something we >>> should support. >>> >>> We are still debating how to do this at API level. Two options: >>> >>> *Option 1.* Create a smaller surfaced, parallel Expression library, and >>> use that for specifying partition columns. The bare minimum class hierarchy >>> would look like: >>> >>> trait Expression >>> >>> class NamedFunction(name: String, args: Seq[Expression]) extends >>> Expression >>> >>> class Literal(value: Any) extends Expression >>> >>> class ColumnReference(name: String) extends Expression >>> >>> These classes don't define how the expressions are evaluated, and it'd >>> be up to the data sources to interpret them. As an example, for a table >>> partitioned by date(ts), Spark would pass the following to the underlying >>> ds: >>> >>> NamedFunction("date", ColumnReference("timestamp") :: Nil) >>> >>> >>> *Option 2.* Spark passes strings over to the data sources. For the >>> above example, Spark simply passes "date(ts)" as a string over. >>> >>> >>> The pros/cons of 1 vs 2 are basically the inverse of each other. Option >>> 1 creates more rigid structure, with extra complexity in API design. Option >>> 2 is less structured but more flexible. Option 1 gives Spark the >>> opportunity to enforce column references are valid (but not the actual >>> function names), whereas option 2 would be up to the data sources to >>> validate. >>> >>> >>> >>> On Wed, Aug 15, 2018 at 2:27 PM Ryan Blue <rb...@netflix.com> wrote: >>> >>>> I think I found a good solution to the problem of using Expression in >>>> the TableCatalog API and in the DeleteSupport API. >>>> >>>> For DeleteSupport, there is already a stable and public subset of >>>> Expression named Filter that can be used to pass filters. The reason why >>>> DeleteSupport would use Expression is to support more complex expressions >>>> like to_date(ts) = '2018-08-15' that are translated to ts >= >>>> 1534316400000000 AND ts < 1534402800000000. But, this can be done in >>>> Spark instead of the data sources so I think DeleteSupport should use >>>> Filter instead. I updated the DeleteSupport PR #21308 >>>> <https://github.com/apache/spark/pull/21308> with these changes. >>>> >>>> Also, I agree that the DataSourceV2 API should also not expose >>>> Expression, so I opened SPARK-25127 to track removing >>>> SupportsPushDownCatalystFilter >>>> <https://issues.apache.org/jira/browse/SPARK-25127>. >>>> >>>> For TableCatalog, I took a similar approach instead of introducing a >>>> parallel Expression API. Instead, I created a PartitionTransform API (like >>>> Filter) that communicates the transformation function, function parameters >>>> like num buckets, and column references. I updated the TableCatalog PR >>>> #21306 <https://github.com/apache/spark/pull/21306> to use >>>> PartitionTransform instead of Expression and I updated the text of the SPIP >>>> doc >>>> <https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#heading=h.m45webtwxf2d> >>>> . >>>> >>>> I also raised a concern about needing to wait for Spark to add support >>>> for new expressions (now partition transforms). To get around this, I added >>>> an apply transform that passes the name of a function and an input >>>> column. That way, users can still pass transforms that Spark doesn’t know >>>> about by name to data sources: apply("source_function", "colName"). >>>> >>>> Please have a look at the updated pull requests and SPIP doc and >>>> comment! >>>> >>>> rb >>>> >>> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> >