[
https://issues.apache.org/jira/browse/HUDI-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
pengzhiwei updated HUDI-1484:
-----------------------------
Summary: Missing Encode the partition path value when it contains "/"
(was: Missing Encode the partition path when the value contains "/")
> Missing Encode the partition path value when it contains "/"
> ------------------------------------------------------------
>
> Key: HUDI-1484
> URL: https://issues.apache.org/jira/browse/HUDI-1484
> Project: Apache Hudi
> Issue Type: Improvement
> Components: Writer Core
> Reporter: pengzhiwei
> Assignee: pengzhiwei
> Priority: Major
> Fix For: 0.7.0
>
>
> Currently Hudi have not encode the partition path if the partitionPath value
> contains "/". This will result in incorrect result for query.
> Here is a code which will result query exception:
> {code:java}
> import spark.implicits._
> val df = (for (i <- 0 until 10) yield (i, s"a$i", 10 + i, 100 * i + 10000,
> "12/20/1"))
> .toDF("id", "name", "price", "version", "pt")
> df.write.format("hudi")
> .option(TABLE_NAME, tableName)
> .option(RECORDKEY_FIELD_OPT_KEY, "id")
> .option(PRECOMBINE_FIELD_OPT_KEY, "version")
> .option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL)
> .option(INSERT_PARALLELISM, "8")
> .option(UPSERT_PARALLELISM, "8")
> .mode(Overwrite)
> .save(basePath)
> val readDf = spark.read.format("hudi")
> .load(basePath + "/*/*")
> readDf.createOrReplaceTempView("r1")
> spark.sql(s"select count(1) from r1").show()
> {code}
> A Exception throw out as follow:
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to
> infer schema for Parquet. It must be specified manually.;Exception in thread
> "main" org.apache.spark.sql.AnalysisException: Unable to infer schema for
> Parquet. It must be specified manually.; at
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:190)
> at
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:190)
> at scala.Option.getOrElse(Option.scala:121) at
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:189)
> at
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:417)
> {code}
> The generated output directory for pt = '12/20/1' is a multi-level directory
> *${basePath}/12/20/1,* However *pt* is a single-level partition,which result
> the query exception.
> I have test the similar case using parquet in spark-sql with pt = '12/20/1'
> ,the generate output dirctory is *${basePath}/12%2F20%2F1*, which have
> encoded the partition path.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)