[jira] [Resolved] (HUDI-1392) lose partition info when using spark parameter "basePath"

Gary Li (Jira) Wed, 25 Nov 2020 07:52:07 -0800


     [ 
https://issues.apache.org/jira/browse/HUDI-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Gary Li resolved HUDI-1392.
---------------------------
    Resolution: Fixed

> lose partition info when using spark parameter "basePath" 
> ----------------------------------------------------------
>
>                 Key: HUDI-1392
>                 URL: https://issues.apache.org/jira/browse/HUDI-1392
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: Spark Integration
>            Reporter: steven zhang
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.6.1
>
>
> Reproduce the issue with below steps:
>         set hoodie.datasource.write.hive_style_partitioning->true
>         spark.read().format("org.apache.hudi").option("mergeSchema", 
> true).option("basePath",tablePath).load(tablePath + (nonPartitionedTable ? 
> "/*" : "/*")).createOrReplaceTempView(hudiTable);
>         spark.sql("select * from hudiTable where date>'20200807'").explain();
>         print PartitionFilters: []
>  the reason is: 
> step 1. spark  read datasource  
> (https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
>  L 317)
>  
>           case (dataSource: RelationProvider, None) => 
> dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions)  
> //caseInsensitiveOptions CaseInsensitiveMap type
>  
> step 2. hudi  create relation
>          org.apache.hudi.DefaultSource#createRelation(sqlContext: 
> SQLContext,optParams: Map[String, String],schema: StructType): BaseRelation = 
> {
>  
>          // the type optParams is CaseInsensitiveMap. and parameters type 
> will be converted to Map thought Map ++
>          val parameters = Map(QUERY_TYPE_OPT_KEY -> 
> DEFAULT_QUERY_TYPE_OPT_VAL) ++ translateViewTypesToQueryTypes(optParams)
>  
> step 3. hudi  transform to parquet relation if we query table(cow type) data
>          then it will call getBaseFileOnlyView(sqlContext, parameters, 
> schema, readPaths, isBootstrappedTable, globPaths, metaClient)
>  
> it will create new Datasource and relation instance with : 
> DataSource.apply(sparkSession = sqlContext.sparkSession,paths = 
> extraReadPaths,userSpecifiedSchema = Option(schema),className = 
> "parquet",options = optParams).resolveRelation()
>  
> step 4. spark fetch basePath for infer partition info 
> (https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala
>  L196)
>            //the parameters come from DataSource #options (map type)
>           parameters.get(BASE_PATH_PARAM)
>           so parameters.get(BASE_PATH_PARAM) will call Map#get not 
> CaseInsensitiveMap#get. and parameters stored “bathpath” . get “bathPath” 
> will return None
> this is a spark bug (fixed at 3.0.1 version 
> https://issues.apache.org/jira/browse/SPARK-32368) hudi current used spark 
> v2.4.4
> in order to avoid this spark issure  a simple solution is we can not convert 
> the input optParams type(spark already make it  CaseInsensitiveMap type) in 
> org.apache.hudi.DefaultSource#createRelation(sqlContext: 
> SQLContext,optParams: Map[String, String]…
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-1392) lose partition info when using spark parameter "basePath"

Reply via email to