[jira] [Comment Edited] (HUDI-1307) spark datasource load path format is confused for snapshot and increment read mode

Vu Ho (Jira) Tue, 06 Oct 2020 02:25:14 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208600#comment-17208600
 ]


Vu Ho edited comment on HUDI-1307 at 10/6/20, 9:24 AM:
-------------------------------------------------------

Hi [~309637554] and [~vinoth],

I'm able to query hudi table (SNAPSHOT_QUERY) using just the base path. The 
trick is to use Hive-style partitioning i.e. part=value for the directory and 
Spark should be able to figure out the partition automatically. 

Using plain value as partitions (e.g. 2020/12/12) has a few drawbacks. First, 
globing the files can be quite expensive. Second, Spark will not be able to 
infer the partition schema and therefore query will not be using partition 
filters. 
{code:java}
val cow = spark.read.format("hudi").option(QUERY_TYPE_OPT_KEY, 
QUERY_TYPE_SNAPSHOT_OPT_VAL).load(cowTablePathLocal + "/*/*/*/*")
cow.createOrReplaceTempView("cow")
spark.sql("select * from cow where year  = 2020").explain
== Physical Plan ==
*(1) Project [_hoodie_commit_time#238, _hoodie_commit_seqno#239, 
_hoodie_record_key#240, _hoodie_partition_path#241, _hoodie_file_name#242, 
timestamp#243, value#244L, year#245, month#246, day#247, ts#248, dt#249, id#250]
+- *(1) Filter (isnotnull(year#245) && (year#245 = 2020))
   +- *(1) FileScan parquet 
[_hoodie_commit_time#238,_hoodie_commit_seqno#239,_hoodie_record_key#240,_hoodie_partition_path#241,_hoodie_file_name#242,timestamp#243,value#244L,year#245,month#246,day#247,ts#248,dt#249,id#250]
 Batched: true, Format: Parquet, Location: 
InMemoryFileIndex[file:/tmp/hoodie/hoodie_streaming_cow/.hoodie/.aux/.bootstrap/.partitions,
 file..., PartitionFilters: [], PushedFilters: [IsNotNull(year), 
EqualTo(year,2020)], ReadSchema: 
struct<_hoodie_commit_time:string,_hoodie_commit_seqno:string,_hoodie_record_key:string,_hoodie_p..
 {code}
example

 


was (Author: vho):
Hi [~309637554] and [~vinoth],

I'm able to query hudi table (SNAPSHOT_QUERY) using just the base path. The 
trick is to use Hive-style partitioning i.e. part=value for the directory and 
Spark should be able to figure out the partition automatically. 

Using plain value as partitions (e.g. 2020/12/12) has a few drawbacks. First, 
globing the files can be quite expensive. Second, Spark will not be able to 
infer the partition schema and therefore query will not be using partition 
filters. 

> spark datasource load path format is confused for snapshot and increment read 
> mode
> ----------------------------------------------------------------------------------
>
>                 Key: HUDI-1307
>                 URL: https://issues.apache.org/jira/browse/HUDI-1307
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: Spark Integration
>            Reporter: liwei
>            Assignee: liwei
>            Priority: Major
>
> as spark datasource read hudi table
> 1、snapshot mode
> {code:java}
>  val readHudi = spark.read.format("org.apache.hudi").load(basePath + "/*");
> should add "/*" ,otherwise will fail, because in 
> org.apache.hudi.DefaultSource.
> createRelation() will use fs.globStatus(). if do not have "/*" will not get 
> .hoodie and default dir
> val globPaths = HoodieSparkUtils.checkAndGlobPathIfNecessary(allPaths, 
> fs){code}
>  
> 2、increment mode
> both basePath and  basePath + "/*"  is ok.This is because in 
> org.apache.hudi.DefaultSource  
> DataSourceUtils.getTablePath can support both the two format.
> {code:java}
>  val incViewDF = spark.read.format("org.apache.hudi").
>  option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
>  option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
>  option(END_INSTANTTIME_OPT_KEY, endTime).
>  load(basePath){code}
>  
> {code:java}
>  val incViewDF = spark.read.format("org.apache.hudi").
>  option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
>  option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
>  option(END_INSTANTTIME_OPT_KEY, endTime).
>  load(basePath + "/*")
>  {code}
>  
> as  increment mode and snapshot mode not coincide, user will confuse .Also 
> load use basepath +"/*"  *or "/***/*"* is  confuse. I know this is to support 
> partition.
> but i think this api will more clear for user
>  
> {code:java}
>  partition = "year = '2019'"
> spark.read .format("hudi") .load(path) .where(partition) {code}
>  
>  ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HUDI-1307) spark datasource load path format is confused for snapshot and increment read mode

Reply via email to