pengzhiwei created HUDI-1484:
--------------------------------
Summary: Missing Encode the partition path when the value contains
"/"
Key: HUDI-1484
URL: https://issues.apache.org/jira/browse/HUDI-1484
Project: Apache Hudi
Issue Type: Improvement
Components: Writer Core
Reporter: pengzhiwei
Assignee: pengzhiwei
Fix For: 0.7.0
Currently Hudi have not encode the partition path if the partitionPath value
contains "/". This will result in incorrect result for query.
Here is a code which will result query exception:
{code:java}
import spark.implicits._
val df = (for (i <- 0 until 10) yield (i, s"a$i", 10 + i, 100 * i + 10000,
"12/20/1"))
.toDF("id", "name", "price", "version", "pt")
df.write.format("hudi")
.option(TABLE_NAME, tableName)
.option(RECORDKEY_FIELD_OPT_KEY, "id")
.option(PRECOMBINE_FIELD_OPT_KEY, "version")
.option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL)
.option(INSERT_PARALLELISM, "8")
.option(UPSERT_PARALLELISM, "8")
.mode(Overwrite)
.save(basePath)
val readDf = spark.read.format("hudi")
.load(basePath + "/*/*")
readDf.createOrReplaceTempView("r1")
spark.sql(s"select count(1) from r1").show()
{code}
A Exception throw out as follow:
{code:java}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to
infer schema for Parquet. It must be specified manually.;Exception in thread
"main" org.apache.spark.sql.AnalysisException: Unable to infer schema for
Parquet. It must be specified manually.; at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:190)
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:190)
at scala.Option.getOrElse(Option.scala:121) at
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:189)
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:417)
{code}
The generate output directory for pt = '12/20/1' is a multi-level directory
*${basePath}/12/20/1,*However *pt* is a single-stage directory,which result the
query exception.
I have test the similar case using parquet in spark-sql with pt = '12/20/1'
,the generate output dirctory is ${basePath}/12%2F20%2F1, which have encoded
the partition path.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)