[
https://issues.apache.org/jira/browse/HUDI-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HUDI-4250:
---------------------------------
Labels: pull-request-available (was: )
> Optimize Data Skipping to enable in-memory Column Stats Index
> --------------------------------------------------------------
>
> Key: HUDI-4250
> URL: https://issues.apache.org/jira/browse/HUDI-4250
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: Alexey Kudinkin
> Assignee: Alexey Kudinkin
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 0.12.0
>
>
> Executing on Spark has non-trivial amount of overhead, and therefore has to
> have a potential of considerable speed-up due to parallelization of the
> execution.
> In case of Data Skipping seq reading Column Stats Index it only could be
> justified for *very large* table (100s of 1000s of files, 100s of columns).
> As such, we have to provide an alternative way of fetching Column Stats Index
> w/in the reading process to avoid the penalty of scheduling more heavy-weight
> execution t/h a Spark engine.
> This, along w/ HUDI-4202, will allow to considerably speed up Data Skipping
> Currently having overhead of *at least* 1-2s even for tables with a handful
> of files.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)