Alexey Kudinkin created HUDI-4250:
-------------------------------------

             Summary: Optimize Data Skipping to enable in-memory Column Stats 
Index 
                 Key: HUDI-4250
                 URL: https://issues.apache.org/jira/browse/HUDI-4250
             Project: Apache Hudi
          Issue Type: Bug
            Reporter: Alexey Kudinkin
            Assignee: Alexey Kudinkin
             Fix For: 0.12.0


Executing on Spark has non-trivial amount of overhead, and therefore has to 
have a potential of considerable speed-up due to parallelization of the 
execution.

In case of Data Skipping seq reading Column Stats Index it only could be 
justified for *very large* table (100s of 1000s of files, 100s of columns). 

As such, we have to provide an alternative way of fetching Column Stats Index 
w/in the reading process to avoid the penalty of scheduling more heavy-weight 
execution t/h a Spark engine.

This, along w/ HUDI-4202, will allow to considerably speed up Data Skipping 
Currently having overhead of *at least* 1-2s even for tables with a handful of 
files.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to