[
https://issues.apache.org/jira/browse/HUDI-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
pengzhiwei updated HUDI-1484:
-----------------------------
Description:
Currently Hudi will encode the partition value when
URL_ENCODE_PARTITIONING_OPT_KEY set true. However the HiveSyncTool has not
decode the partition value when sync partition to hive. And Hive will encode
the partition value twice which lead to an exception when query with hive sql
or spark sql.
For example the partition *"2020/12/20*" will encode to *"2020%2F12%2F20"* by
Hudi. When HiveSyncTool sync the *"2020%2F12%2F20"* to hive, Hive will encode
it to "
*"2020%252F12%252F20".* This will result to query exception for "select xx from
tbl where dt = '2020/12/20'
was:
Currently Hudi have not encode the partition path if the partitionPath value
contains "/". This will result in incorrect result for query.
Here is a code which will result query exception:
{code:java}
import spark.implicits._
val df = (for (i <- 0 until 10) yield (i, s"a$i", 10 + i, 100 * i + 10000,
"12/20/1"))
.toDF("id", "name", "price", "version", "pt")
df.write.format("hudi")
.option(TABLE_NAME, tableName)
.option(RECORDKEY_FIELD_OPT_KEY, "id")
.option(PRECOMBINE_FIELD_OPT_KEY, "version")
.option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL)
.option(INSERT_PARALLELISM, "8")
.option(UPSERT_PARALLELISM, "8")
.mode(Overwrite)
.save(basePath)
val readDf = spark.read.format("hudi")
.load(basePath + "/*/*")
readDf.createOrReplaceTempView("r1")
spark.sql(s"select count(1) from r1").show()
{code}
A Exception throw out as follow:
{code:java}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to
infer schema for Parquet. It must be specified manually.;Exception in thread
"main" org.apache.spark.sql.AnalysisException: Unable to infer schema for
Parquet. It must be specified manually.; at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:190)
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:190)
at scala.Option.getOrElse(Option.scala:121) at
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:189)
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:417)
{code}
The generated output directory for pt = '12/20/1' is a multi-level directory
*${basePath}/12/20/1,* However *pt* is a single-level partition,which result
the query exception.
I have test the similar case using parquet in spark-sql with pt = '12/20/1'
,the generate output dirctory is *${basePath}/12%2F20%2F1*, which have encoded
the partition path.
> Escape the partition value in HiveSyncTool
> -------------------------------------------
>
> Key: HUDI-1484
> URL: https://issues.apache.org/jira/browse/HUDI-1484
> Project: Apache Hudi
> Issue Type: Bug
> Components: Writer Core
> Reporter: pengzhiwei
> Assignee: pengzhiwei
> Priority: Major
> Fix For: 0.7.0
>
>
> Currently Hudi will encode the partition value when
> URL_ENCODE_PARTITIONING_OPT_KEY set true. However the HiveSyncTool has not
> decode the partition value when sync partition to hive. And Hive will encode
> the partition value twice which lead to an exception when query with hive sql
> or spark sql.
> For example the partition *"2020/12/20*" will encode to *"2020%2F12%2F20"* by
> Hudi. When HiveSyncTool sync the *"2020%2F12%2F20"* to hive, Hive will encode
> it to "
> *"2020%252F12%252F20".* This will result to query exception for "select xx
> from tbl where dt = '2020/12/20'
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)