[jira] [Updated] (HUDI-1307) spark datasource load path format is confused for snapshot and increment read mode

liwei (Jira) Tue, 29 Sep 2020 23:35:23 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


liwei updated HUDI-1307:
------------------------
    Description: 
as spark datasource read hudi table

1、snapshot mode

val readHudi = spark.read.format("org.apache.hudi").load(basePath + "/*");

should add "/*" ,otherwise will fail, because in org.apache.hudi.DefaultSource.

createRelation() will use fs.globStatus(). if do not have "/*" will not get 
.hoodie and default dir

val globPaths = HoodieSparkUtils.checkAndGlobPathIfNecessary(allPaths, fs)

 

2、increment mode

both basePath and  basePath + "/*"  is ok.This is because in 
org.apache.hudi.DefaultSource  

DataSourceUtils.getTablePath can support both the two format.

val incViewDF = spark.read.format("org.apache.hudi").
 option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
 option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
 option(END_INSTANTTIME_OPT_KEY, endTime).
 load(basePath)

 

val incViewDF = spark.read.format("org.apache.hudi").
 option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
 option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
 option(END_INSTANTTIME_OPT_KEY, endTime).
 load(basePath + "/*")

 

as  increment mode and snapshot mode not coincide, user will confuse .Also load 
use basepath +'/*'  *or '/**/*' is  confuse. I know this is to support 
partition.

but i think this api will more clear for user

partition = "year = '2019'"

spark.read .format("hudi") .load(path) .where(partition) 

 

  was:
as spark datasource read hudi table

1、snapshot mode

val readHudi = spark.read.format("org.apache.hudi").load(basePath + "/*");

should add "/*" ,otherwise will fail, because in org.apache.hudi.DefaultSource.

createRelation() will use fs.globStatus(). if do not have "/*" will not get 
.hoodie and default dir

val globPaths = HoodieSparkUtils.checkAndGlobPathIfNecessary(allPaths, fs)

 

2、increment mode

both basePath and  basePath + "/*"  is ok.This is because in 
org.apache.hudi.DefaultSource  

DataSourceUtils.getTablePath can support both the two format.

val incViewDF = spark.read.format("org.apache.hudi").
 option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
 option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
 option(END_INSTANTTIME_OPT_KEY, endTime).
 load(basePath)

 

val incViewDF = spark.read.format("org.apache.hudi").
 option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
 option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
 option(END_INSTANTTIME_OPT_KEY, endTime).
 load(basePath + "/*")

 

as  increment mode and snapshot mode not coincide, user will confuse .Also load 
use basepath + /*  or /*/* is  confuse. I know this is to support partition.

but i think this api will more clear for user

partition = "year = '2019'"

spark.read .format("hudi") .load(path) .where(partition) 

 


> spark datasource load path format is confused for snapshot and increment read 
> mode
> ----------------------------------------------------------------------------------
>
>                 Key: HUDI-1307
>                 URL: https://issues.apache.org/jira/browse/HUDI-1307
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: Spark Integration
>            Reporter: liwei
>            Assignee: liwei
>            Priority: Major
>
> as spark datasource read hudi table
> 1、snapshot mode
> val readHudi = spark.read.format("org.apache.hudi").load(basePath + "/*");
> should add "/*" ,otherwise will fail, because in 
> org.apache.hudi.DefaultSource.
> createRelation() will use fs.globStatus(). if do not have "/*" will not get 
> .hoodie and default dir
> val globPaths = HoodieSparkUtils.checkAndGlobPathIfNecessary(allPaths, fs)
>  
> 2、increment mode
> both basePath and  basePath + "/*"  is ok.This is because in 
> org.apache.hudi.DefaultSource  
> DataSourceUtils.getTablePath can support both the two format.
> val incViewDF = spark.read.format("org.apache.hudi").
>  option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
>  option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
>  option(END_INSTANTTIME_OPT_KEY, endTime).
>  load(basePath)
>  
> val incViewDF = spark.read.format("org.apache.hudi").
>  option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
>  option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
>  option(END_INSTANTTIME_OPT_KEY, endTime).
>  load(basePath + "/*")
>  
> as  increment mode and snapshot mode not coincide, user will confuse .Also 
> load use basepath +'/*'  *or '/**/*' is  confuse. I know this is to support 
> partition.
> but i think this api will more clear for user
> partition = "year = '2019'"
> spark.read .format("hudi") .load(path) .where(partition) 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1307) spark datasource load path format is confused for snapshot and increment read mode

Reply via email to