[
https://issues.apache.org/jira/browse/HUDI-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
steven zhang updated HUDI-1392:
-------------------------------
Description:
Reproduce the issue with below steps:
set hoodie.datasource.write.hive_style_partitioning->true
spark.read().format("org.apache.hudi").option("mergeSchema",
true).option("basePath",tablePath).load(tablePath + (nonPartitionedTable ? "/*"
: "/*")).createOrReplaceTempView(hudiTable);
spark.sql("select * from hudiTable where date>'20200807'").explain();
print PartitionFilters: []
the reason is:
step 1. spark read datasource
(https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
L 317)
case (dataSource: RelationProvider, None) =>
dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions)
//caseInsensitiveOptions CaseInsensitiveMap type
step 2. hudi create relation
org.apache.hudi.DefaultSource#createRelation(sqlContext:
SQLContext,optParams: Map[String, String],schema: StructType): BaseRelation = {
// the type optParams is CaseInsensitiveMap. and parameters type will
be converted to Map thought Map ++
val parameters = Map(QUERY_TYPE_OPT_KEY -> DEFAULT_QUERY_TYPE_OPT_VAL)
++ translateViewTypesToQueryTypes(optParams)
step 3. hudi transform to parquet relation if we query table(cow type) data
then it will call getBaseFileOnlyView(sqlContext, parameters, schema,
readPaths, isBootstrappedTable, globPaths, metaClient)
it will create new Datasource and relation instance with :
DataSource.apply(sparkSession = sqlContext.sparkSession,paths =
extraReadPaths,userSpecifiedSchema = Option(schema),className =
"parquet",options = optParams).resolveRelation()
step 4. spark fetch basePath for infer partition info
(https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala
L196)
//the parameters come from DataSource #options (map type)
parameters.get(BASE_PATH_PARAM)
so parameters.get(BASE_PATH_PARAM) will call Map#get not
CaseInsensitiveMap#get. and parameters stored “bathpath” . get “bathPath” will
return None
this is a spark bug (fixed at 3.0.1 version
https://issues.apache.org/jira/browse/SPARK-32368) hudi current used spark
v2.4.4
in order to avoid this spark issure a simple solution is we can not convert
the input optParams type(spark already make it CaseInsensitiveMap type) in
org.apache.hudi.DefaultSource#createRelation(sqlContext: SQLContext,optParams:
Map[String, String]…
was:
Reproduce the issue with below steps:
set hoodie.datasource.write.hive_style_partitioning->true
spark.read().format("org.apache.hudi").option("mergeSchema",
true).option("basePath",tablePath).load(tablePath + (nonPartitionedTable ? "/*"
: "/*")).createOrReplaceTempView(hudiTable);
spark.sql("select * from hudiTable where date>'20200807'").explain();
print PartitionFilters: []
the cause of this issue is org.apache.hudi.DefaultSource#createRelation is call
by dataSource.createRelation(sparkSession.sqlContext,
caseInsensitiveOptions)([https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala]
L318)
the input optParams is CaseInsensitiveMap type. hudi attached additional
parameters such as
val parameters = Map(QUERY_TYPE_OPT_KEY -> DEFAULT_QUERY_TYPE_OPT_VAL) ++
translateViewTypesToQueryTypes(optParams)
the parameters type has been converted Map not CaseInsensitiveMap
parquet datasource infer Partition info will fetch basePath value thought
parameters.get(BASE_PATH_PARAM) (
[https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala]
L196) then the get method will not call CaseInsensitiveMap#get. just call
Map#get("bathPath") and return None. so it will cause infer nothing partition
info.
and i found spark 2.4.7 version above (
https://issues.apache.org/jira/browse/SPARK-32364 ) has use caseInsensitiveMap
to fetch basePath although the intention of it is not same as this hudi issue.
and the lower spark version also has this issue.
so we need using
val parameters = translateViewTypesToQueryTypes(optParams) ++
Map(QUERY_TYPE_OPT_KEY -> DEFAULT_QUERY_TYPE_OPT_VAL)
for two reason: 1.lower spark version also has this issue 2. original type
converted
> lose partition info when using spark parameter "basePath"
> ----------------------------------------------------------
>
> Key: HUDI-1392
> URL: https://issues.apache.org/jira/browse/HUDI-1392
> Project: Apache Hudi
> Issue Type: Bug
> Components: Spark Integration
> Reporter: steven zhang
> Priority: Major
> Labels: pull-request-available
> Fix For: 0.6.1
>
>
> Reproduce the issue with below steps:
> set hoodie.datasource.write.hive_style_partitioning->true
> spark.read().format("org.apache.hudi").option("mergeSchema",
> true).option("basePath",tablePath).load(tablePath + (nonPartitionedTable ?
> "/*" : "/*")).createOrReplaceTempView(hudiTable);
> spark.sql("select * from hudiTable where date>'20200807'").explain();
> print PartitionFilters: []
> the reason is:
> step 1. spark read datasource
> (https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
> L 317)
>
> case (dataSource: RelationProvider, None) =>
> dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions)
> //caseInsensitiveOptions CaseInsensitiveMap type
>
> step 2. hudi create relation
> org.apache.hudi.DefaultSource#createRelation(sqlContext:
> SQLContext,optParams: Map[String, String],schema: StructType): BaseRelation =
> {
>
> // the type optParams is CaseInsensitiveMap. and parameters type
> will be converted to Map thought Map ++
> val parameters = Map(QUERY_TYPE_OPT_KEY ->
> DEFAULT_QUERY_TYPE_OPT_VAL) ++ translateViewTypesToQueryTypes(optParams)
>
> step 3. hudi transform to parquet relation if we query table(cow type) data
> then it will call getBaseFileOnlyView(sqlContext, parameters,
> schema, readPaths, isBootstrappedTable, globPaths, metaClient)
>
> it will create new Datasource and relation instance with :
> DataSource.apply(sparkSession = sqlContext.sparkSession,paths =
> extraReadPaths,userSpecifiedSchema = Option(schema),className =
> "parquet",options = optParams).resolveRelation()
>
> step 4. spark fetch basePath for infer partition info
> (https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala
> L196)
> //the parameters come from DataSource #options (map type)
> parameters.get(BASE_PATH_PARAM)
> so parameters.get(BASE_PATH_PARAM) will call Map#get not
> CaseInsensitiveMap#get. and parameters stored “bathpath” . get “bathPath”
> will return None
> this is a spark bug (fixed at 3.0.1 version
> https://issues.apache.org/jira/browse/SPARK-32368) hudi current used spark
> v2.4.4
> in order to avoid this spark issure a simple solution is we can not convert
> the input optParams type(spark already make it CaseInsensitiveMap type) in
> org.apache.hudi.DefaultSource#createRelation(sqlContext:
> SQLContext,optParams: Map[String, String]…
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)