[jira] [Created] (HUDI-4250) Optimize Data Skipping to enable in-memory Column Stats Index

Alexey Kudinkin (Jira) Mon, 13 Jun 2022 17:41:05 -0700

Alexey Kudinkin created HUDI-4250:
-------------------------------------

             Summary: Optimize Data Skipping to enable in-memory Column Stats 
Index 
                 Key: HUDI-4250
                 URL: https://issues.apache.org/jira/browse/HUDI-4250
             Project: Apache Hudi
          Issue Type: Bug
            Reporter: Alexey Kudinkin
            Assignee: Alexey Kudinkin
             Fix For: 0.12.0



Executing on Spark has non-trivial amount of overhead, and therefore has to 
have a potential of considerable speed-up due to parallelization of the 
execution.

In case of Data Skipping seq reading Column Stats Index it only could be 
justified for *very large* table (100s of 1000s of files, 100s of columns). 

As such, we have to provide an alternative way of fetching Column Stats Index 
w/in the reading process to avoid the penalty of scheduling more heavy-weight 
execution t/h a Spark engine.

This, along w/ HUDI-4202, will allow to considerably speed up Data Skipping 
Currently having overhead of *at least* 1-2s even for tables with a handful of 
files.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (HUDI-4250) Optimize Data Skipping to enable in-memory Column Stats Index

Reply via email to