yui2010 commented on pull request #2243:
URL: https://github.com/apache/hudi/pull/2243#issuecomment-732670918
hi, garyli1019
i try to describe it clearly
set hoodie.datasource.write.hive_style_partitioning->true
spark.read().format("org.apache.hudi").option("mergeSchema",
true).option("basePath",tablePath).load(tablePath + (nonPartitionedTable ? "/"
: "/")).createOrReplaceTempView(hudiTable);
spark.sql("select * from hudiTable where date>'20200807'").explain();
step 1. spark read datasource
(https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
L 317)
case (dataSource: RelationProvider, None) =>
dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions)
//**caseInsensitiveOptions CaseInsensitiveMap type**
step 2. hudi create relation
org.apache.hudi.DefaultSource#createRelation(sqlContext:
SQLContext,optParams: Map[String, String],schema: StructType): BaseRelation = {
// the type optParams is CaseInsensitiveMap. and **parameters type
will be converted to Map thought Map ++**
val parameters = Map(QUERY_TYPE_OPT_KEY ->
DEFAULT_QUERY_TYPE_OPT_VAL) ++ translateViewTypesToQueryTypes(optParams)
step 3. hudi transform to parquet relation if we query table(cow type) data
then it will call getBaseFileOnlyView(sqlContext, parameters,
schema, readPaths, isBootstrappedTable, globPaths, metaClient)
it will create new Datasource and relation instance with :
DataSource.apply(sparkSession = sqlContext.sparkSession,paths =
extraReadPaths,userSpecifiedSchema = Option(schema),className =
"parquet",options = optParams).resolveRelation()
step 4. spark fetch basePath for infer partition info
(https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala
L196)
//the parameters come from DataSource #options (map type)
parameters.get(BASE_PATH_PARAM)
so parameters.get(BASE_PATH_PARAM) will call Map#get not
CaseInsensitiveMap#get. and parameters stored “bathpath” . get “bathPath” will
return None
this is a spark bug (fixed at 3.0.1 version
https://issues.apache.org/jira/browse/SPARK-32368) hudi current used spark
v2.4.4 .
in order to avoid this spark issure a simple solution is we can not convert
the input optParams type(spark already make it CaseInsensitiveMap type) in
org.apache.hudi.DefaultSource#createRelation(sqlContext: SQLContext,optParams:
Map[String, String]…
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]