Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

Hyukjin Kwon Wed, 22 Jan 2020 23:43:35 -0800

There's another PR open to expose this more publicity in Python side (
https://github.com/apache/spark/pull/27331).


To sum up, I would like to make sure we know these below:
- Is this expression only for partition or do we plan to expose this to
replace other existing expressions as some kind of public DSv2 expression
API?
- Do we want to support other expressions here?
  - If so, why do we need partition-specific expressions?
  - If not, why don't we use a different syntax and class for this API?
- What about we expose a native function to allow transform like a UDF?

Ryan and Wenchen, do you mind if I ask answers for these questions?

2020년 1월 17일 (금) 오전 10:25, Hyukjin Kwon <gurwls...@gmail.com>님이 작성:

> Thanks for giving me some context and clarification, Ryan.
>
> I think I was rather trying to propose to revert because I don't see the
> explicit plan here and it was just left half-done for a long while.
> From reading the PR description and codes, I could not guess in which way
> we should fix this API (e.g., is this expression only for partition or
> replacement of all expressions?). Also, if you take a look at the commit
> log, it has not been fixed for 10 months except moving around or minor
> fixes.
>
> Do you mind if I ask how we plan to extend this feature? For example,
> - if we want to replace existing expressions at the end
> - if we want to add a copy of expressions for some reasons.
> - How will we handle ambiguity of supported expressions between other
> datasource implementations and Spark.
> - If we can't tell other expressions are supported here, why don't we just
> use different syntax to clarify?
>
> If we have this plan or doc, and people can fix accordingly with
> incremental improvements, I am good to keep it.
>
>
> Here are some of followup questions and answers:
>
> > I don't think there is reason to revert this simply because of some of
> the early choices, like deciding to start a public expression API. If you'd
> like to extend this to "fix" areas where you find it confusing, then please
> do.
>
> If it's clear that we should redesign the API, or there is no more plan
> about that API at this moment, I think it can be an option to revert, in
> particular, considering that code freeze is being close. For example, why
> did we try UDF-like way by exposing a function interface only.
>
>
> > The idea was that Spark needs a public expression API anyway for other
> uses
>
> I was wondering why we should we a public expression API in DSv2. Is there
> some places where UDFs can't cover?
> As said above, it requires a duplication of existing expressions is
> required and wonder if this is worthwhile.
> With the stub of Transform API, it looks we want this but I don't know why.
>
>
> > None of this has been confusing or misleading for our users, who caught
> on quickly.
>
> Maybe using simple case wouldn't bring so much confusions if they were
> already told about it.
> However, if we think about the difference and subtleties, I doubt if the
> users already know the answers:
>
> CREATE TABLE table(col INT) USING parquet PARTITIONED BY *transform(col)*
>
>   - It looks expressions and allowing other expressions / combinations
>   - Since the expressions are handled by DSv2, the behaviours are
> dependent on DSv2 e.g., using *transform* against Datasource
> implementation A and B are different.
>  - Likewise, if Spark supports *transform* here, the behaviour will be
> different.
>
>
> 2020년 1월 17일 (금) 오전 2:36, Ryan Blue <rb...@netflix.com>님이 작성:
>
>> Hi everyone,
>>
>> Let me recap some of the discussions that got us to where we are with
>> this today. Hopefully that will provide some clarity.
>>
>> The purpose of partition transforms is to allow source implementations to
>> internally handle partitioning. Right now, users are responsible for this.
>> For example, users will transform timestamps into date strings when writing
>> and other people will provide a filter on those date strings when scanning.
>> This is error-prone: users commonly forget to add partition filters in
>> addition to data filters, if anyone uses the wrong format or transformation
>> queries will silently return incorrect results, etc. But sources can (and
>> should) automatically handle storing and retrieving data internally because
>> it is much easier for users.
>>
>> When we first proposed transforms, I wanted to use Expression. But
>> Reynold rightly pointed out that Expression is an internal API that should
>> not be exposed. So we decided to compromise by building a public
>> expressions API like the public Filter API for the initial purpose of
>> passing transform expressions to sources. The idea was that Spark needs a
>> public expression API anyway for other uses, like requesting a distribution
>> and ordering for a writer. To keep things simple, we chose to build a
>> minimal public expression API and expand it incrementally as we need more
>> features.
>>
>> We also considered whether to parse all expressions and convert only
>> transformations to the public API, or to parse just transformations. We
>> went with just parsing transformations because it was easier and we can
>> expand it to improve error messages later.
>>
>> I don't think there is reason to revert this simply because of some of
>> the early choices, like deciding to start a public expression API. If you'd
>> like to extend this to "fix" areas where you find it confusing, then please
>> do. We know that by parsing more expressions we could improve error
>> messages. But that's not to say that we need to revert it.
>>
>> None of this has been confusing or misleading for our users, who caught
>> on quickly.
>>
>> On Thu, Jan 16, 2020 at 5:14 AM Hyukjin Kwon <gurwls...@gmail.com> wrote:
>>
>>> I think the problem here is if there is an explicit plan or not.
>>> The PR was merged one year ago and not many changes have been made to
>>> this API to address the main concerns mentioned.
>>> Also, the followup JIRA requested seems still open
>>> https://issues.apache.org/jira/browse/SPARK-27386
>>> I heard this was already discussed but I cannot find the summary of the
>>> meeting or any documentation.
>>>
>>> I would like to make sure how we plan to extend. I had a couple of
>>> questions such as:
>>>   - Why can't we use UDF-interface-like as an example?
>>>   - Is this expression only for partition or do we plan to expose this
>>> to replace other existing expressions?
>>>
>>> > About extensibility, it's similar to DS V1 Filter again. We don't
>>> cover all the expressions at the beginning, but we can add more in future
>>> versions when needed. The data source implementations should be defensive
>>> and fail when seeing unrecognized Filter/Transform.
>>>
>>> I think there are differences in that:
>>> - DSv1 filter works whether the filters are pushed or not However, this
>>> case does not work.
>>> - There are too many expressions whereas the number of predicates are
>>> relatively limited. If we plan to push all expressions eventually, I doubt
>>> if this is a good idea.
>>>
>>>
>>> 2020년 1월 16일 (목) 오후 9:22, Wenchen Fan <cloud0...@gmail.com>님이 작성:
>>>
>>>> The DS v2 project is still evolving so half-backed is inevitable
>>>> sometimes. This feature is definitely in the right direction to allow more
>>>> flexible partition implementations, but there are a few problems we can
>>>> discuss.
>>>>
>>>> About expression duplication. This is an existing design choice. We
>>>> don't want to expose the Expression class directly but we do need to expose
>>>> some Expression-like stuff in the developer APIs. So we pick some basic
>>>> expressions, make a copy and create a public version of them. This is what
>>>> we did for DS V1 Filter, and I think we can continue to do this for DS v2
>>>> Transform.
>>>>
>>>> About extensibility, it's similar to DS V1 Filter again. We don't cover
>>>> all the expressions at the beginning, but we can add more in future
>>>> versions when needed. The data source implementations should be defensive
>>>> and fail when seeing unrecognized Filter/Transform.
>>>>
>>>> About compatibility. This is the place that I have a concern as well.
>>>> For DS V1 Filter, we just expose all the Filter classes, like `EqualTo`,
>>>> `GreaterThan`, etc. These classes have well-defined semantic. For DS V2
>>>> Transform, we only expose the Transform interface, and data sources need to
>>>> look at `Transform#name` and search the document to see the semantic.
>>>> What's worse, the parser/analyzer allows arbitrary string as Transform
>>>> name, so it's impossible to have well-defined semantic, and also different
>>>> sources may have different semantic for the same Transform name.
>>>>
>>>> I'd suggest we forbid arbitrary string as Transform (the ApplyTransform
>>>> class). We can even follow DS  V1 Filter and expose the classes directly.
>>>>
>>>> On Thu, Jan 16, 2020 at 6:56 PM Hyukjin Kwon <gurwls...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I would like to suggest to take one step back at
>>>>> https://github.com/apache/spark/pull/24117 and rethink about it.
>>>>> I am writing this email as I raised the issue few times but could not
>>>>> have enough responses promptly, and
>>>>> the code freeze is being close.
>>>>>
>>>>> In particular, please refer the below comments for the full context:
>>>>> - https://github.com/apache/spark/pull/24117#issuecomment-568891483
>>>>> - https://github.com/apache/spark/pull/24117#issuecomment-568614961
>>>>> - https://github.com/apache/spark/pull/24117#issuecomment-568891483
>>>>>
>>>>>
>>>>> In short, this PR added an API in DSv2:
>>>>>
>>>>> CREATE TABLE table(col INT) USING parquet PARTITIONED BY *transform(col)*
>>>>>
>>>>>
>>>>> So people can write some classes for *transform(col)* for partitioned
>>>>> column specifically.
>>>>>
>>>>> However, there are some design concerns which looked not addressed
>>>>> properly.
>>>>>
>>>>> Note that one of the main point is to avoid half-baked or
>>>>> just-work-for-now APIs. However, this looks
>>>>> definitely like half-completed. Therefore, I would like to propose to
>>>>> take one step back and revert it for now.
>>>>> Please see below the concerns listed.
>>>>>
>>>>> *Duplication of existing expressions*
>>>>> Seems like existing expressions are going to be duplicated. See below
>>>>> new APIs added:
>>>>>
>>>>> def years(column: String): YearsTransform = 
>>>>> YearsTransform(reference(column))
>>>>> def months(column: String): MonthsTransform = 
>>>>> MonthsTransform(reference(column))
>>>>> def days(column: String): DaysTransform = DaysTransform(reference(column))
>>>>> def hours(column: String): HoursTransform = 
>>>>> HoursTransform(reference(column))
>>>>> ...
>>>>>
>>>>> It looks like it requires to add a copy of our existing expressions,
>>>>> in the future.
>>>>>
>>>>>
>>>>> *Limited Extensibility*
>>>>> It has a clear limitation. It looks other expressions are going to be
>>>>> allowed together (e.g., `concat(years(col) + days(col))`);
>>>>> however, it looks impossible to extend with the current design. It
>>>>> just directly maps transformName to implementation class,
>>>>> and just pass arguments:
>>>>>
>>>>> transform
>>>>>     ...
>>>>>     | transformName=identifier
>>>>>       '(' argument+=transformArgument (',' argument+=transformArgument)* 
>>>>> ')'  #applyTransform
>>>>>     ;
>>>>>
>>>>> It looks regular expressions are supported; however, it's not.
>>>>> - If we should support, the design had to consider that.
>>>>> - if we should not support, different syntax might have to be used
>>>>> instead.
>>>>>
>>>>> *Limited Compatibility Management*
>>>>> The name can be arbitrary. For instance, if "transform" is supported
>>>>> in Spark side, the name is preempted by Spark.
>>>>> If every the datasource supported such name, it becomes not compatible.
>>>>>
>>>>>
>>>>>
>>>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

Reply via email to