[
https://issues.apache.org/jira/browse/HUDI-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gary Li resolved HUDI-1392.
---------------------------
Resolution: Fixed
> lose partition info when using spark parameter "basePath"
> ----------------------------------------------------------
>
> Key: HUDI-1392
> URL: https://issues.apache.org/jira/browse/HUDI-1392
> Project: Apache Hudi
> Issue Type: Bug
> Components: Spark Integration
> Reporter: steven zhang
> Priority: Major
> Labels: pull-request-available
> Fix For: 0.6.1
>
>
> Reproduce the issue with below steps:
> set hoodie.datasource.write.hive_style_partitioning->true
> spark.read().format("org.apache.hudi").option("mergeSchema",
> true).option("basePath",tablePath).load(tablePath + (nonPartitionedTable ?
> "/*" : "/*")).createOrReplaceTempView(hudiTable);
> spark.sql("select * from hudiTable where date>'20200807'").explain();
> print PartitionFilters: []
> the reason is:
> step 1. spark read datasource
> (https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
> L 317)
>
> case (dataSource: RelationProvider, None) =>
> dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions)
> //caseInsensitiveOptions CaseInsensitiveMap type
>
> step 2. hudi create relation
> org.apache.hudi.DefaultSource#createRelation(sqlContext:
> SQLContext,optParams: Map[String, String],schema: StructType): BaseRelation =
> {
>
> // the type optParams is CaseInsensitiveMap. and parameters type
> will be converted to Map thought Map ++
> val parameters = Map(QUERY_TYPE_OPT_KEY ->
> DEFAULT_QUERY_TYPE_OPT_VAL) ++ translateViewTypesToQueryTypes(optParams)
>
> step 3. hudi transform to parquet relation if we query table(cow type) data
> then it will call getBaseFileOnlyView(sqlContext, parameters,
> schema, readPaths, isBootstrappedTable, globPaths, metaClient)
>
> it will create new Datasource and relation instance with :
> DataSource.apply(sparkSession = sqlContext.sparkSession,paths =
> extraReadPaths,userSpecifiedSchema = Option(schema),className =
> "parquet",options = optParams).resolveRelation()
>
> step 4. spark fetch basePath for infer partition info
> (https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala
> L196)
> //the parameters come from DataSource #options (map type)
> parameters.get(BASE_PATH_PARAM)
> so parameters.get(BASE_PATH_PARAM) will call Map#get not
> CaseInsensitiveMap#get. and parameters stored “bathpath” . get “bathPath”
> will return None
> this is a spark bug (fixed at 3.0.1 version
> https://issues.apache.org/jira/browse/SPARK-32368) hudi current used spark
> v2.4.4
> in order to avoid this spark issure a simple solution is we can not convert
> the input optParams type(spark already make it CaseInsensitiveMap type) in
> org.apache.hudi.DefaultSource#createRelation(sqlContext:
> SQLContext,optParams: Map[String, String]…
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)