[
https://issues.apache.org/jira/browse/HUDI-3594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503863#comment-17503863
]
Alexey Kudinkin commented on HUDI-3594:
---------------------------------------
Among Spark Standard Functions following do preserve the ordering upon
transformation:
Date/Timestamp
- date
- date_add
- date_format
- date_sub
- from_unixtime
- from_utc_timestamp
- to_date
- to_timestamp
- to_unix_timestamp
- unix_timestamp* (when converting)
Math
- exp
- expm1
- hex* (unless it returns string)
- ln/log/log10/log1p/log2
- rank
- shiftleft/shiftright
- tanh/sinh
- sqrt
- | (bit OR)
Strings
- truncate/left
- lcase/lower/ucase/upper
- repeat
- rpad
> Support standard Spark functions in Filter Exprs in Data Skipping
> -----------------------------------------------------------------
>
> Key: HUDI-3594
> URL: https://issues.apache.org/jira/browse/HUDI-3594
> Project: Apache Hudi
> Issue Type: Improvement
> Reporter: Alexey Kudinkin
> Assignee: Alexey Kudinkin
> Priority: Blocker
> Fix For: 0.11.0
>
>
> As part of this effort we're planning to (at the very least) support a suite
> of standard Spark functions when evaluating Data Filtering expressions w/in
> Data Skipping flow, for ex: when user is issuing a following query
>
> {code:java}
> SELECT ... WHERE date_format(ts, 'dd-mm-yyyy') > '01-01-2022'
> {code}
> We're able to relate such query to our Column Stats Index appropriately,
> therefore being able to do Data Skipping not only on the "raw" columns, but
> also upon simple derivative expressions on top of them (like standard
> function calls){*}{{*}}
>
> *Important to note here, is that only transformations that _preserve the
> ordering of the source column_ can be applied. Transformations not preserving
> the ordering will render Column Stats index practically irrelevant (since no
> assumption could be made that values in the column derived by such
> transformations are ordered)*
--
This message was sent by Atlassian Jira
(v8.20.1#820001)