[GitHub] [hudi] yui2010 commented on pull request #2243: HUDI-1392 lose partition info when using spark parameter basePath

GitBox Mon, 23 Nov 2020 21:55:16 -0800


yui2010 commented on pull request #2243:
URL: https://github.com/apache/hudi/pull/2243#issuecomment-732670918



   
   hi, garyli1019 
   
       i try to describe it clearly
   
           set hoodie.datasource.write.hive_style_partitioning->true
           spark.read().format("org.apache.hudi").option("mergeSchema", 
true).option("basePath",tablePath).load(tablePath + (nonPartitionedTable ? "/" 
: "/")).createOrReplaceTempView(hudiTable);
           spark.sql("select * from hudiTable where date>'20200807'").explain();
   
   step 1. spark  read datasource  
(https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 L 317)
   
             case (dataSource: RelationProvider, None) => 
dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions)  
//**caseInsensitiveOptions CaseInsensitiveMap type**
   
   step 2. hudi  create relation
            org.apache.hudi.DefaultSource#createRelation(sqlContext: 
SQLContext,optParams: Map[String, String],schema: StructType): BaseRelation = {
   
            // the type optParams is CaseInsensitiveMap. and **parameters type 
will be converted to Map thought Map ++**
            val parameters = Map(QUERY_TYPE_OPT_KEY -> 
DEFAULT_QUERY_TYPE_OPT_VAL) ++ translateViewTypesToQueryTypes(optParams)
   
   step 3. hudi  transform to parquet relation if we query table(cow type) data 
            then it will call getBaseFileOnlyView(sqlContext, parameters, 
schema, readPaths, isBootstrappedTable, globPaths, metaClient)
   
   it will create new Datasource and relation instance with : 
DataSource.apply(sparkSession = sqlContext.sparkSession,paths = 
extraReadPaths,userSpecifiedSchema = Option(schema),className = 
"parquet",options = optParams).resolveRelation()
   
   step 4. spark fetch basePath for infer partition info 
(https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala
 L196)
              //the parameters come from DataSource #options (map type)
             parameters.get(BASE_PATH_PARAM) 
             so parameters.get(BASE_PATH_PARAM) will call Map#get not 
CaseInsensitiveMap#get. and parameters stored “bathpath” . get “bathPath” will 
return None
   
   this is a spark bug (fixed at 3.0.1 version 
https://issues.apache.org/jira/browse/SPARK-32368) hudi current used spark 
v2.4.4 .
   in order to avoid this spark issure  a simple solution is we can not convert 
the input optParams type(spark already make it  CaseInsensitiveMap type) in 
org.apache.hudi.DefaultSource#createRelation(sqlContext: SQLContext,optParams: 
Map[String, String]…
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] yui2010 commented on pull request #2243: HUDI-1392 lose partition info when using spark parameter basePath

Reply via email to