[jira] [Comment Edited] (HUDI-1371) Implement Spark datasource by fetching file listing from metadata table

Vinoth Chandar (Jira) Thu, 03 Dec 2020 15:25:07 -0800


    [ 
https://issues.apache.org/jira/browse/HUDI-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243574#comment-17243574
 ]


Vinoth Chandar edited comment on HUDI-1371 at 12/3/20, 11:24 PM:
-----------------------------------------------------------------

||Engine||Table Type||Listing Mechanism||
|Spark SQL on Hive |COW|a) with path filter: parallel listing (we good here) + 
native vectorized reader
 Open: how do we avoid Spark from listing again? 
 b) with convertMetastoreParquet=false; Spark will call `ipf.getSplits()` to 
list of files, use Hive's parquet record reader.|
|Spark SQL on Hive |MOR| with convertMetastoreParquet=false; Spark will call 
`ipf.getSplits()` to list of files, use Hudi's record reader.|
|Spark Datasource|COW|We wrap Parquet Data source, which does parallel listing 
+ use path filter to filter down more. 
 Open: Can we pass in a HadoopFsRelation to avoid the listing inside parquet 
data source|
|Spark Datasource|MOR| MergeOnReadSnapshotRelation (some feature gaps etc...) 
Possible unknown issues.|
|Presto|COW| PathFilter + BackgroundHiveLoader (presto's own listing mechanism)|
|Presto|MOR| |
|Hive|COW| |
|Hive |MOR| |


was (Author: vc):
||Engine||Table Type||Listing Mechanism||
|Spark SQL on Hive |COW|a) with path filter: parallel listing (we good here) + 
native vectorized reader
 Open: how do we avoid Spark from listing again? 
 b) with convertMetastoreParquet=false; Spark will call `ipf.getSplits()` to 
list of files, use Hive's parquet record reader.|
|Spark SQL on Hive |MOR| with convertMetastoreParquet=false; Spark will call 
`ipf.getSplits()` to list of files, use Hudi's record reader.|
|Spark Datasource|COW|We wrap Parquet Data source, which does parallel listing 
+ use path filter to filter down more. 
Open: Can we pass in a HadoopFsRelation to avoid the listing inside parquet 
data source|
|Spark Datasource|MOR| MergeOnReadSnapshotRelation|
|Presto|COW| |
|Presto|MOR| |
|Hive|COW| |
|Hive |MOR| |

> Implement Spark datasource by fetching file listing from metadata table
> -----------------------------------------------------------------------
>
>                 Key: HUDI-1371
>                 URL: https://issues.apache.org/jira/browse/HUDI-1371
>             Project: Apache Hudi
>          Issue Type: Sub-task
>          Components: Spark Integration
>            Reporter: Vinoth Chandar
>            Assignee: Udit Mehrotra
>            Priority: Blocker
>             Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HUDI-1371) Implement Spark datasource by fetching file listing from metadata table

Reply via email to