[jira] [Updated] (HUDI-1879) Spark DataSource tables/HoodieFileIndex issues for Merge On Read

Udit Mehrotra (Jira) Thu, 06 May 2021 21:30:11 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Udit Mehrotra updated HUDI-1879:
--------------------------------
    Description: 
*Read as DataSource Tables* and *HoodieFileIndex* implementation that went in 
[https://github.com/apache/hudi/pull/2283] and 
[https://github.com/apache/hudi/pull/2651] has introduced a couple of major 
regressions for *Merge on Read* tables:
 * *_ro* *tables returning Snapshot results*: Since we are directly using Hudi 
DataSource now to query *_ro* and *_rt* MOR tables, the DataSource has no way 
to recognize the difference between read optimized and real time tables as it 
has no way to check for *table name*. In both these scenarios 
*{color:#172b4d}QUERY_TYPE_OPT_KEY{color}*{color:#172b4d} turns out to be 
*snapshot* by default, which is causing *MergeOnReadSnapshotRelation* to be 
used for querying thus returning snapshot results always.{color}
 * *{color:#172b4d}Partition pruning{color}* *{color:#172b4d}does not 
work{color}* *{color:#172b4d}for realtime queries{color}*{color:#172b4d}: The 
*MergeOnReadSnapshotRelation* is directly using *allFiles* to always fetch all 
the files without doing any partition pruning. This is a regression for Spark 
SQL real time queries because earlier partition pruning would work via 
InputFormat for these queries. Thus, it will have impact on rt queries 
performance.{color}

  was:
*HoodieFileIndex* implementation that went in 
[https://github.com/apache/hudi/pull/2651] has introduced a couple of major 
regressions for *Merge on Read* tables:
 * *_ro* *tables returning Snapshot results*: Since we are directly using 
DataSource now to query both


> Spark DataSource tables/HoodieFileIndex issues for Merge On Read
> ----------------------------------------------------------------
>
>                 Key: HUDI-1879
>                 URL: https://issues.apache.org/jira/browse/HUDI-1879
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: Spark Integration
>            Reporter: Udit Mehrotra
>            Priority: Blocker
>              Labels: sev:critical
>
> *Read as DataSource Tables* and *HoodieFileIndex* implementation that went in 
> [https://github.com/apache/hudi/pull/2283] and 
> [https://github.com/apache/hudi/pull/2651] has introduced a couple of major 
> regressions for *Merge on Read* tables:
>  * *_ro* *tables returning Snapshot results*: Since we are directly using 
> Hudi DataSource now to query *_ro* and *_rt* MOR tables, the DataSource has 
> no way to recognize the difference between read optimized and real time 
> tables as it has no way to check for *table name*. In both these scenarios 
> *{color:#172b4d}QUERY_TYPE_OPT_KEY{color}*{color:#172b4d} turns out to be 
> *snapshot* by default, which is causing *MergeOnReadSnapshotRelation* to be 
> used for querying thus returning snapshot results always.{color}
>  * *{color:#172b4d}Partition pruning{color}* *{color:#172b4d}does not 
> work{color}* *{color:#172b4d}for realtime queries{color}*{color:#172b4d}: The 
> *MergeOnReadSnapshotRelation* is directly using *allFiles* to always fetch 
> all the files without doing any partition pruning. This is a regression for 
> Spark SQL real time queries because earlier partition pruning would work via 
> InputFormat for these queries. Thus, it will have impact on rt queries 
> performance.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1879) Spark DataSource tables/HoodieFileIndex issues for Merge On Read

Reply via email to