[
https://issues.apache.org/jira/browse/HUDI-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
liwei updated HUDI-1307:
------------------------
Description:
as spark datasource read hudi table
1、snapshot mode
val readHudi = spark.read.format("org.apache.hudi").load(basePath + "/*");
should add "/*" ,otherwise will fail, because in org.apache.hudi.DefaultSource.
createRelation() will use fs.globStatus(). if do not have "/*" will not get
.hoodie and default dir
val globPaths = HoodieSparkUtils.checkAndGlobPathIfNecessary(allPaths, fs)
2、increment mode
both basePath and basePath + "/*" is ok.This is because in
org.apache.hudi.DefaultSource
DataSourceUtils.getTablePath can support both the two format.
val incViewDF = spark.read.format("org.apache.hudi").
option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
option(END_INSTANTTIME_OPT_KEY, endTime).
load(basePath)
val incViewDF = spark.read.format("org.apache.hudi").
option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
option(END_INSTANTTIME_OPT_KEY, endTime).
load(basePath + "/*")
as increment mode and snapshot mode not coincide, user will confuse .Also load
use basepath +"/*" *or "/***/*"* is confuse. I know this is to support
partition.
but i think this api will more clear for user
partition = "year = '2019'"
spark.read .format("hudi") .load(path) .where(partition)
was:
as spark datasource read hudi table
1、snapshot mode
val readHudi = spark.read.format("org.apache.hudi").load(basePath + "/*");
should add "/*" ,otherwise will fail, because in org.apache.hudi.DefaultSource.
createRelation() will use fs.globStatus(). if do not have "/*" will not get
.hoodie and default dir
val globPaths = HoodieSparkUtils.checkAndGlobPathIfNecessary(allPaths, fs)
2、increment mode
both basePath and basePath + "/*" is ok.This is because in
org.apache.hudi.DefaultSource
DataSourceUtils.getTablePath can support both the two format.
val incViewDF = spark.read.format("org.apache.hudi").
option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
option(END_INSTANTTIME_OPT_KEY, endTime).
load(basePath)
val incViewDF = spark.read.format("org.apache.hudi").
option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
option(END_INSTANTTIME_OPT_KEY, endTime).
load(basePath + "/*")
as increment mode and snapshot mode not coincide, user will confuse .Also load
use basepath +'/*' *or '/**/*' is confuse. I know this is to support
partition.
but i think this api will more clear for user
partition = "year = '2019'"
spark.read .format("hudi") .load(path) .where(partition)
> spark datasource load path format is confused for snapshot and increment read
> mode
> ----------------------------------------------------------------------------------
>
> Key: HUDI-1307
> URL: https://issues.apache.org/jira/browse/HUDI-1307
> Project: Apache Hudi
> Issue Type: Improvement
> Components: Spark Integration
> Reporter: liwei
> Assignee: liwei
> Priority: Major
>
> as spark datasource read hudi table
> 1、snapshot mode
> val readHudi = spark.read.format("org.apache.hudi").load(basePath + "/*");
> should add "/*" ,otherwise will fail, because in
> org.apache.hudi.DefaultSource.
> createRelation() will use fs.globStatus(). if do not have "/*" will not get
> .hoodie and default dir
> val globPaths = HoodieSparkUtils.checkAndGlobPathIfNecessary(allPaths, fs)
>
> 2、increment mode
> both basePath and basePath + "/*" is ok.This is because in
> org.apache.hudi.DefaultSource
> DataSourceUtils.getTablePath can support both the two format.
> val incViewDF = spark.read.format("org.apache.hudi").
> option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
> option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
> option(END_INSTANTTIME_OPT_KEY, endTime).
> load(basePath)
>
> val incViewDF = spark.read.format("org.apache.hudi").
> option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
> option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
> option(END_INSTANTTIME_OPT_KEY, endTime).
> load(basePath + "/*")
>
> as increment mode and snapshot mode not coincide, user will confuse .Also
> load use basepath +"/*" *or "/***/*"* is confuse. I know this is to support
> partition.
> but i think this api will more clear for user
> partition = "year = '2019'"
> spark.read .format("hudi") .load(path) .where(partition)
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)