[jira] [Updated] (HUDI-1392) lose partition info when using spark parameter "basePath"

steven zhang (Jira) Tue, 24 Nov 2020 21:57:04 -0800


     [ 
https://issues.apache.org/jira/browse/HUDI-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


steven zhang updated HUDI-1392:
-------------------------------
    Description: 
Reproduce the issue with below steps:

        set hoodie.datasource.write.hive_style_partitioning->true

        spark.read().format("org.apache.hudi").option("mergeSchema", 
true).option("basePath",tablePath).load(tablePath + (nonPartitionedTable ? "/*" 
: "/*")).createOrReplaceTempView(hudiTable);

        spark.sql("select * from hudiTable where date>'20200807'").explain();

        print PartitionFilters: []

 the reason is: 

step 1. spark  read datasource  
(https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 L 317)

 

          case (dataSource: RelationProvider, None) => 
dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions)  
//caseInsensitiveOptions CaseInsensitiveMap type

 

step 2. hudi  create relation

         org.apache.hudi.DefaultSource#createRelation(sqlContext: 
SQLContext,optParams: Map[String, String],schema: StructType): BaseRelation = {

 

         // the type optParams is CaseInsensitiveMap. and parameters type will 
be converted to Map thought Map ++

         val parameters = Map(QUERY_TYPE_OPT_KEY -> DEFAULT_QUERY_TYPE_OPT_VAL) 
++ translateViewTypesToQueryTypes(optParams)

 

step 3. hudi  transform to parquet relation if we query table(cow type) data

         then it will call getBaseFileOnlyView(sqlContext, parameters, schema, 
readPaths, isBootstrappedTable, globPaths, metaClient)

 

it will create new Datasource and relation instance with : 
DataSource.apply(sparkSession = sqlContext.sparkSession,paths = 
extraReadPaths,userSpecifiedSchema = Option(schema),className = 
"parquet",options = optParams).resolveRelation()

 

step 4. spark fetch basePath for infer partition info 
(https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala
 L196)

           //the parameters come from DataSource #options (map type)

          parameters.get(BASE_PATH_PARAM)

          so parameters.get(BASE_PATH_PARAM) will call Map#get not 
CaseInsensitiveMap#get. and parameters stored “bathpath” . get “bathPath” will 
return None

this is a spark bug (fixed at 3.0.1 version 
https://issues.apache.org/jira/browse/SPARK-32368) hudi current used spark 
v2.4.4

in order to avoid this spark issure  a simple solution is we can not convert 
the input optParams type(spark already make it  CaseInsensitiveMap type) in 
org.apache.hudi.DefaultSource#createRelation(sqlContext: SQLContext,optParams: 
Map[String, String]…

  

  was:
Reproduce the issue with below steps:

        set hoodie.datasource.write.hive_style_partitioning->true

        spark.read().format("org.apache.hudi").option("mergeSchema", 
true).option("basePath",tablePath).load(tablePath + (nonPartitionedTable ? "/*" 
: "/*")).createOrReplaceTempView(hudiTable);

        spark.sql("select * from hudiTable where date>'20200807'").explain();

        print PartitionFilters: []

the cause of this issue is org.apache.hudi.DefaultSource#createRelation is call 
by dataSource.createRelation(sparkSession.sqlContext, 
caseInsensitiveOptions)([https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala]
 L318)

the input optParams is CaseInsensitiveMap type. hudi attached additional 
parameters such as

val parameters = Map(QUERY_TYPE_OPT_KEY -> DEFAULT_QUERY_TYPE_OPT_VAL) ++ 
translateViewTypesToQueryTypes(optParams)

the parameters  type has been converted Map not CaseInsensitiveMap

parquet datasource infer Partition info will fetch basePath value thought 
parameters.get(BASE_PATH_PARAM) (  
[https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala]
 L196) then the get method will not call CaseInsensitiveMap#get. just call 
Map#get("bathPath") and return None. so it will cause infer nothing partition 
info.

and i found spark 2.4.7 version above ( 
https://issues.apache.org/jira/browse/SPARK-32364 ) has use caseInsensitiveMap 
to fetch basePath although the intention of it is not same as this hudi issue. 
and the lower spark version also has this issue.

so  we need using 

val parameters = translateViewTypesToQueryTypes(optParams) ++ 
Map(QUERY_TYPE_OPT_KEY -> DEFAULT_QUERY_TYPE_OPT_VAL)

for two reason: 1.lower spark version also has this issue  2. original type 
converted

  


> lose partition info when using spark parameter "basePath" 
> ----------------------------------------------------------
>
>                 Key: HUDI-1392
>                 URL: https://issues.apache.org/jira/browse/HUDI-1392
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: Spark Integration
>            Reporter: steven zhang
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.6.1
>
>
> Reproduce the issue with below steps:
>         set hoodie.datasource.write.hive_style_partitioning->true
>         spark.read().format("org.apache.hudi").option("mergeSchema", 
> true).option("basePath",tablePath).load(tablePath + (nonPartitionedTable ? 
> "/*" : "/*")).createOrReplaceTempView(hudiTable);
>         spark.sql("select * from hudiTable where date>'20200807'").explain();
>         print PartitionFilters: []
>  the reason is: 
> step 1. spark  read datasource  
> (https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
>  L 317)
>  
>           case (dataSource: RelationProvider, None) => 
> dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions)  
> //caseInsensitiveOptions CaseInsensitiveMap type
>  
> step 2. hudi  create relation
>          org.apache.hudi.DefaultSource#createRelation(sqlContext: 
> SQLContext,optParams: Map[String, String],schema: StructType): BaseRelation = 
> {
>  
>          // the type optParams is CaseInsensitiveMap. and parameters type 
> will be converted to Map thought Map ++
>          val parameters = Map(QUERY_TYPE_OPT_KEY -> 
> DEFAULT_QUERY_TYPE_OPT_VAL) ++ translateViewTypesToQueryTypes(optParams)
>  
> step 3. hudi  transform to parquet relation if we query table(cow type) data
>          then it will call getBaseFileOnlyView(sqlContext, parameters, 
> schema, readPaths, isBootstrappedTable, globPaths, metaClient)
>  
> it will create new Datasource and relation instance with : 
> DataSource.apply(sparkSession = sqlContext.sparkSession,paths = 
> extraReadPaths,userSpecifiedSchema = Option(schema),className = 
> "parquet",options = optParams).resolveRelation()
>  
> step 4. spark fetch basePath for infer partition info 
> (https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala
>  L196)
>            //the parameters come from DataSource #options (map type)
>           parameters.get(BASE_PATH_PARAM)
>           so parameters.get(BASE_PATH_PARAM) will call Map#get not 
> CaseInsensitiveMap#get. and parameters stored “bathpath” . get “bathPath” 
> will return None
> this is a spark bug (fixed at 3.0.1 version 
> https://issues.apache.org/jira/browse/SPARK-32368) hudi current used spark 
> v2.4.4
> in order to avoid this spark issure  a simple solution is we can not convert 
> the input optParams type(spark already make it  CaseInsensitiveMap type) in 
> org.apache.hudi.DefaultSource#createRelation(sqlContext: 
> SQLContext,optParams: Map[String, String]…
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1392) lose partition info when using spark parameter "basePath"

Reply via email to