[jira] [Commented] (HUDI-3594) Support standard Spark functions in Filter Exprs in Data Skipping

Alexey Kudinkin (Jira) Wed, 09 Mar 2022 13:41:06 -0800


    [ 
https://issues.apache.org/jira/browse/HUDI-3594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503863#comment-17503863
 ]


Alexey Kudinkin commented on HUDI-3594:
---------------------------------------

Among Spark Standard Functions following do preserve the ordering upon 
transformation:

Date/Timestamp
    - date
    - date_add
    - date_format
    - date_sub
    - from_unixtime
    - from_utc_timestamp
    - to_date
    - to_timestamp
    - to_unix_timestamp
    - unix_timestamp* (when converting)

Math
    - exp
    - expm1
    - hex* (unless it returns string)
    - ln/log/log10/log1p/log2
    - rank
    - shiftleft/shiftright
    - tanh/sinh
    - sqrt
    - | (bit OR)


Strings
    - truncate/left
    - lcase/lower/ucase/upper
    - repeat
    - rpad

> Support standard Spark functions in Filter Exprs in Data Skipping
> -----------------------------------------------------------------
>
>                 Key: HUDI-3594
>                 URL: https://issues.apache.org/jira/browse/HUDI-3594
>             Project: Apache Hudi
>          Issue Type: Improvement
>            Reporter: Alexey Kudinkin
>            Assignee: Alexey Kudinkin
>            Priority: Blocker
>             Fix For: 0.11.0
>
>
> As part of this effort we're planning to (at the very least) support a suite 
> of standard Spark functions when evaluating Data Filtering expressions w/in 
> Data Skipping flow, for ex: when user is issuing a following query 
>  
> {code:java}
> SELECT ... WHERE date_format(ts, 'dd-mm-yyyy') > '01-01-2022'
> {code}
> We're able to relate such query to our Column Stats Index appropriately, 
> therefore being able to do Data Skipping not only on the "raw" columns, but 
> also upon simple derivative expressions on top of them (like standard 
> function calls){*}{{*}}
>  
> *Important to note here, is that only transformations that _preserve the 
> ordering of the source column_ can be applied. Transformations not preserving 
> the ordering will render Column Stats index practically irrelevant (since no 
> assumption could be made that values in the column derived by such 
> transformations are ordered)*



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (HUDI-3594) Support standard Spark functions in Filter Exprs in Data Skipping

Reply via email to