[
https://issues.apache.org/jira/browse/HUDI-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208600#comment-17208600
]
Vu Ho edited comment on HUDI-1307 at 10/6/20, 9:24 AM:
-------------------------------------------------------
Hi [~309637554] and [~vinoth],
I'm able to query hudi table (SNAPSHOT_QUERY) using just the base path. The
trick is to use Hive-style partitioning i.e. part=value for the directory and
Spark should be able to figure out the partition automatically.
Using plain value as partitions (e.g. 2020/12/12) has a few drawbacks. First,
globing the files can be quite expensive. Second, Spark will not be able to
infer the partition schema and therefore query will not be using partition
filters.
{code:java}
val cow = spark.read.format("hudi").option(QUERY_TYPE_OPT_KEY,
QUERY_TYPE_SNAPSHOT_OPT_VAL).load(cowTablePathLocal + "/*/*/*/*")
cow.createOrReplaceTempView("cow")
spark.sql("select * from cow where year = 2020").explain
== Physical Plan ==
*(1) Project [_hoodie_commit_time#238, _hoodie_commit_seqno#239,
_hoodie_record_key#240, _hoodie_partition_path#241, _hoodie_file_name#242,
timestamp#243, value#244L, year#245, month#246, day#247, ts#248, dt#249, id#250]
+- *(1) Filter (isnotnull(year#245) && (year#245 = 2020))
+- *(1) FileScan parquet
[_hoodie_commit_time#238,_hoodie_commit_seqno#239,_hoodie_record_key#240,_hoodie_partition_path#241,_hoodie_file_name#242,timestamp#243,value#244L,year#245,month#246,day#247,ts#248,dt#249,id#250]
Batched: true, Format: Parquet, Location:
InMemoryFileIndex[file:/tmp/hoodie/hoodie_streaming_cow/.hoodie/.aux/.bootstrap/.partitions,
file..., PartitionFilters: [], PushedFilters: [IsNotNull(year),
EqualTo(year,2020)], ReadSchema:
struct<_hoodie_commit_time:string,_hoodie_commit_seqno:string,_hoodie_record_key:string,_hoodie_p..
{code}
example
was (Author: vho):
Hi [~309637554] and [~vinoth],
I'm able to query hudi table (SNAPSHOT_QUERY) using just the base path. The
trick is to use Hive-style partitioning i.e. part=value for the directory and
Spark should be able to figure out the partition automatically.
Using plain value as partitions (e.g. 2020/12/12) has a few drawbacks. First,
globing the files can be quite expensive. Second, Spark will not be able to
infer the partition schema and therefore query will not be using partition
filters.
> spark datasource load path format is confused for snapshot and increment read
> mode
> ----------------------------------------------------------------------------------
>
> Key: HUDI-1307
> URL: https://issues.apache.org/jira/browse/HUDI-1307
> Project: Apache Hudi
> Issue Type: Improvement
> Components: Spark Integration
> Reporter: liwei
> Assignee: liwei
> Priority: Major
>
> as spark datasource read hudi table
> 1、snapshot mode
> {code:java}
> val readHudi = spark.read.format("org.apache.hudi").load(basePath + "/*");
> should add "/*" ,otherwise will fail, because in
> org.apache.hudi.DefaultSource.
> createRelation() will use fs.globStatus(). if do not have "/*" will not get
> .hoodie and default dir
> val globPaths = HoodieSparkUtils.checkAndGlobPathIfNecessary(allPaths,
> fs){code}
>
> 2、increment mode
> both basePath and basePath + "/*" is ok.This is because in
> org.apache.hudi.DefaultSource
> DataSourceUtils.getTablePath can support both the two format.
> {code:java}
> val incViewDF = spark.read.format("org.apache.hudi").
> option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
> option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
> option(END_INSTANTTIME_OPT_KEY, endTime).
> load(basePath){code}
>
> {code:java}
> val incViewDF = spark.read.format("org.apache.hudi").
> option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
> option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
> option(END_INSTANTTIME_OPT_KEY, endTime).
> load(basePath + "/*")
> {code}
>
> as increment mode and snapshot mode not coincide, user will confuse .Also
> load use basepath +"/*" *or "/***/*"* is confuse. I know this is to support
> partition.
> but i think this api will more clear for user
>
> {code:java}
> partition = "year = '2019'"
> spark.read .format("hudi") .load(path) .where(partition) {code}
>
> ```
--
This message was sent by Atlassian Jira
(v8.3.4#803005)