[jira] [Created] (HUDI-1157) Optimization whether to query Bootstrapped table using HoodieBootstrapRelation vs Sparks Parquet datasource

Udit Mehrotra (Jira) Thu, 06 Aug 2020 17:03:27 -0700

Udit Mehrotra created HUDI-1157:
-----------------------------------

             Summary: Optimization whether to query Bootstrapped table using 
HoodieBootstrapRelation vs Sparks Parquet datasource
                 Key: HUDI-1157
                 URL: https://issues.apache.org/jira/browse/HUDI-1157
             Project: Apache Hudi
          Issue Type: Sub-task
          Components: bootstrap
            Reporter: Udit Mehrotra



This has been discussed in 
[https://github.com/apache/hudi/pull/1702#discussion_r466317612]

As of now, while querying using *DataSource* we are checking if the table has 
been bootstrapped by the present of *bootstrap base path* in 
*hoodie.properties* file, and based on that query the table using 
*HoodieBootstrapRelation*  vs *Spark Parquet Data Source*. However, there could 
be a scenario where all the files in the originally bootstrapped table have 
wither been *upserted/deleted* and thus have been fully bootstrapped and their 
data has been moved over to the target hoodie table. For such tables, we can 
start querying them using *Spark Parquet Data Source* which will be faster with 
all of spark's optimizations.

So, basically we a need a way to check if all of the files have been fully 
bootstrapped and moved over to the target location.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-1157) Optimization whether to query Bootstrapped table using HoodieBootstrapRelation vs Sparks Parquet datasource

Reply via email to