There's another PR open to expose this more publicity in Python side ( https://github.com/apache/spark/pull/27331).
To sum up, I would like to make sure we know these below: - Is this expression only for partition or do we plan to expose this to replace other existing expressions as some kind of public DSv2 expression API? - Do we want to support other expressions here? - If so, why do we need partition-specific expressions? - If not, why don't we use a different syntax and class for this API? - What about we expose a native function to allow transform like a UDF? Ryan and Wenchen, do you mind if I ask answers for these questions? 2020년 1월 17일 (금) 오전 10:25, Hyukjin Kwon <gurwls...@gmail.com>님이 작성: > Thanks for giving me some context and clarification, Ryan. > > I think I was rather trying to propose to revert because I don't see the > explicit plan here and it was just left half-done for a long while. > From reading the PR description and codes, I could not guess in which way > we should fix this API (e.g., is this expression only for partition or > replacement of all expressions?). Also, if you take a look at the commit > log, it has not been fixed for 10 months except moving around or minor > fixes. > > Do you mind if I ask how we plan to extend this feature? For example, > - if we want to replace existing expressions at the end > - if we want to add a copy of expressions for some reasons. > - How will we handle ambiguity of supported expressions between other > datasource implementations and Spark. > - If we can't tell other expressions are supported here, why don't we just > use different syntax to clarify? > > If we have this plan or doc, and people can fix accordingly with > incremental improvements, I am good to keep it. > > > Here are some of followup questions and answers: > > > I don't think there is reason to revert this simply because of some of > the early choices, like deciding to start a public expression API. If you'd > like to extend this to "fix" areas where you find it confusing, then please > do. > > If it's clear that we should redesign the API, or there is no more plan > about that API at this moment, I think it can be an option to revert, in > particular, considering that code freeze is being close. For example, why > did we try UDF-like way by exposing a function interface only. > > > > The idea was that Spark needs a public expression API anyway for other > uses > > I was wondering why we should we a public expression API in DSv2. Is there > some places where UDFs can't cover? > As said above, it requires a duplication of existing expressions is > required and wonder if this is worthwhile. > With the stub of Transform API, it looks we want this but I don't know why. > > > > None of this has been confusing or misleading for our users, who caught > on quickly. > > Maybe using simple case wouldn't bring so much confusions if they were > already told about it. > However, if we think about the difference and subtleties, I doubt if the > users already know the answers: > > CREATE TABLE table(col INT) USING parquet PARTITIONED BY *transform(col)* > > - It looks expressions and allowing other expressions / combinations > - Since the expressions are handled by DSv2, the behaviours are > dependent on DSv2 e.g., using *transform* against Datasource > implementation A and B are different. > - Likewise, if Spark supports *transform* here, the behaviour will be > different. > > > 2020년 1월 17일 (금) 오전 2:36, Ryan Blue <rb...@netflix.com>님이 작성: > >> Hi everyone, >> >> Let me recap some of the discussions that got us to where we are with >> this today. Hopefully that will provide some clarity. >> >> The purpose of partition transforms is to allow source implementations to >> internally handle partitioning. Right now, users are responsible for this. >> For example, users will transform timestamps into date strings when writing >> and other people will provide a filter on those date strings when scanning. >> This is error-prone: users commonly forget to add partition filters in >> addition to data filters, if anyone uses the wrong format or transformation >> queries will silently return incorrect results, etc. But sources can (and >> should) automatically handle storing and retrieving data internally because >> it is much easier for users. >> >> When we first proposed transforms, I wanted to use Expression. But >> Reynold rightly pointed out that Expression is an internal API that should >> not be exposed. So we decided to compromise by building a public >> expressions API like the public Filter API for the initial purpose of >> passing transform expressions to sources. The idea was that Spark needs a >> public expression API anyway for other uses, like requesting a distribution >> and ordering for a writer. To keep things simple, we chose to build a >> minimal public expression API and expand it incrementally as we need more >> features. >> >> We also considered whether to parse all expressions and convert only >> transformations to the public API, or to parse just transformations. We >> went with just parsing transformations because it was easier and we can >> expand it to improve error messages later. >> >> I don't think there is reason to revert this simply because of some of >> the early choices, like deciding to start a public expression API. If you'd >> like to extend this to "fix" areas where you find it confusing, then please >> do. We know that by parsing more expressions we could improve error >> messages. But that's not to say that we need to revert it. >> >> None of this has been confusing or misleading for our users, who caught >> on quickly. >> >> On Thu, Jan 16, 2020 at 5:14 AM Hyukjin Kwon <gurwls...@gmail.com> wrote: >> >>> I think the problem here is if there is an explicit plan or not. >>> The PR was merged one year ago and not many changes have been made to >>> this API to address the main concerns mentioned. >>> Also, the followup JIRA requested seems still open >>> https://issues.apache.org/jira/browse/SPARK-27386 >>> I heard this was already discussed but I cannot find the summary of the >>> meeting or any documentation. >>> >>> I would like to make sure how we plan to extend. I had a couple of >>> questions such as: >>> - Why can't we use UDF-interface-like as an example? >>> - Is this expression only for partition or do we plan to expose this >>> to replace other existing expressions? >>> >>> > About extensibility, it's similar to DS V1 Filter again. We don't >>> cover all the expressions at the beginning, but we can add more in future >>> versions when needed. The data source implementations should be defensive >>> and fail when seeing unrecognized Filter/Transform. >>> >>> I think there are differences in that: >>> - DSv1 filter works whether the filters are pushed or not However, this >>> case does not work. >>> - There are too many expressions whereas the number of predicates are >>> relatively limited. If we plan to push all expressions eventually, I doubt >>> if this is a good idea. >>> >>> >>> 2020년 1월 16일 (목) 오후 9:22, Wenchen Fan <cloud0...@gmail.com>님이 작성: >>> >>>> The DS v2 project is still evolving so half-backed is inevitable >>>> sometimes. This feature is definitely in the right direction to allow more >>>> flexible partition implementations, but there are a few problems we can >>>> discuss. >>>> >>>> About expression duplication. This is an existing design choice. We >>>> don't want to expose the Expression class directly but we do need to expose >>>> some Expression-like stuff in the developer APIs. So we pick some basic >>>> expressions, make a copy and create a public version of them. This is what >>>> we did for DS V1 Filter, and I think we can continue to do this for DS v2 >>>> Transform. >>>> >>>> About extensibility, it's similar to DS V1 Filter again. We don't cover >>>> all the expressions at the beginning, but we can add more in future >>>> versions when needed. The data source implementations should be defensive >>>> and fail when seeing unrecognized Filter/Transform. >>>> >>>> About compatibility. This is the place that I have a concern as well. >>>> For DS V1 Filter, we just expose all the Filter classes, like `EqualTo`, >>>> `GreaterThan`, etc. These classes have well-defined semantic. For DS V2 >>>> Transform, we only expose the Transform interface, and data sources need to >>>> look at `Transform#name` and search the document to see the semantic. >>>> What's worse, the parser/analyzer allows arbitrary string as Transform >>>> name, so it's impossible to have well-defined semantic, and also different >>>> sources may have different semantic for the same Transform name. >>>> >>>> I'd suggest we forbid arbitrary string as Transform (the ApplyTransform >>>> class). We can even follow DS V1 Filter and expose the classes directly. >>>> >>>> On Thu, Jan 16, 2020 at 6:56 PM Hyukjin Kwon <gurwls...@gmail.com> >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I would like to suggest to take one step back at >>>>> https://github.com/apache/spark/pull/24117 and rethink about it. >>>>> I am writing this email as I raised the issue few times but could not >>>>> have enough responses promptly, and >>>>> the code freeze is being close. >>>>> >>>>> In particular, please refer the below comments for the full context: >>>>> - https://github.com/apache/spark/pull/24117#issuecomment-568891483 >>>>> - https://github.com/apache/spark/pull/24117#issuecomment-568614961 >>>>> - https://github.com/apache/spark/pull/24117#issuecomment-568891483 >>>>> >>>>> >>>>> In short, this PR added an API in DSv2: >>>>> >>>>> CREATE TABLE table(col INT) USING parquet PARTITIONED BY *transform(col)* >>>>> >>>>> >>>>> So people can write some classes for *transform(col)* for partitioned >>>>> column specifically. >>>>> >>>>> However, there are some design concerns which looked not addressed >>>>> properly. >>>>> >>>>> Note that one of the main point is to avoid half-baked or >>>>> just-work-for-now APIs. However, this looks >>>>> definitely like half-completed. Therefore, I would like to propose to >>>>> take one step back and revert it for now. >>>>> Please see below the concerns listed. >>>>> >>>>> *Duplication of existing expressions* >>>>> Seems like existing expressions are going to be duplicated. See below >>>>> new APIs added: >>>>> >>>>> def years(column: String): YearsTransform = >>>>> YearsTransform(reference(column)) >>>>> def months(column: String): MonthsTransform = >>>>> MonthsTransform(reference(column)) >>>>> def days(column: String): DaysTransform = DaysTransform(reference(column)) >>>>> def hours(column: String): HoursTransform = >>>>> HoursTransform(reference(column)) >>>>> ... >>>>> >>>>> It looks like it requires to add a copy of our existing expressions, >>>>> in the future. >>>>> >>>>> >>>>> *Limited Extensibility* >>>>> It has a clear limitation. It looks other expressions are going to be >>>>> allowed together (e.g., `concat(years(col) + days(col))`); >>>>> however, it looks impossible to extend with the current design. It >>>>> just directly maps transformName to implementation class, >>>>> and just pass arguments: >>>>> >>>>> transform >>>>> ... >>>>> | transformName=identifier >>>>> '(' argument+=transformArgument (',' argument+=transformArgument)* >>>>> ')' #applyTransform >>>>> ; >>>>> >>>>> It looks regular expressions are supported; however, it's not. >>>>> - If we should support, the design had to consider that. >>>>> - if we should not support, different syntax might have to be used >>>>> instead. >>>>> >>>>> *Limited Compatibility Management* >>>>> The name can be arbitrary. For instance, if "transform" is supported >>>>> in Spark side, the name is preempted by Spark. >>>>> If every the datasource supported such name, it becomes not compatible. >>>>> >>>>> >>>>> >>>>> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> >