[ 
https://issues.apache.org/jira/browse/HUDI-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4250:
---------------------------------
    Labels: pull-request-available  (was: )

> Optimize Data Skipping to enable in-memory Column Stats Index 
> --------------------------------------------------------------
>
>                 Key: HUDI-4250
>                 URL: https://issues.apache.org/jira/browse/HUDI-4250
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Alexey Kudinkin
>            Assignee: Alexey Kudinkin
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 0.12.0
>
>
> Executing on Spark has non-trivial amount of overhead, and therefore has to 
> have a potential of considerable speed-up due to parallelization of the 
> execution.
> In case of Data Skipping seq reading Column Stats Index it only could be 
> justified for *very large* table (100s of 1000s of files, 100s of columns). 
> As such, we have to provide an alternative way of fetching Column Stats Index 
> w/in the reading process to avoid the penalty of scheduling more heavy-weight 
> execution t/h a Spark engine.
> This, along w/ HUDI-4202, will allow to considerably speed up Data Skipping 
> Currently having overhead of *at least* 1-2s even for tables with a handful 
> of files.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to