[jira] [Updated] (HUDI-1484) Escape the partition value in HiveSyncTool

pengzhiwei (Jira) Mon, 21 Dec 2020 18:35:09 -0800


     [ 
https://issues.apache.org/jira/browse/HUDI-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


pengzhiwei updated HUDI-1484:
-----------------------------
    Description: 
Currently Hudi will encode the partition value when 
URL_ENCODE_PARTITIONING_OPT_KEY set true. However the HiveSyncTool has not 
decode the partition value when sync partition to hive. And Hive will encode 
the partition value twice which lead to an exception when query with hive sql 
or spark sql.

For example the partition *"2020/12/20*" will encode to *"2020%2F12%2F20"* by 
Hudi. When HiveSyncTool sync the *"2020%2F12%2F20"* to hive, Hive will encode 
it to "

*"2020%252F12%252F20".* This will result to query exception for "select xx from 
tbl where dt = '2020/12/20'

 

  was:
Currently Hudi have not encode the partition path if the partitionPath value 
contains "/". This will result in incorrect result for query. 

Here is a code which will result query exception:
{code:java}
import spark.implicits._
val df = (for (i <- 0 until 10) yield (i, s"a$i", 10 + i, 100 * i + 10000, 
"12/20/1"))
  .toDF("id", "name", "price", "version", "pt")

df.write.format("hudi")
  .option(TABLE_NAME, tableName)
  .option(RECORDKEY_FIELD_OPT_KEY, "id")
  .option(PRECOMBINE_FIELD_OPT_KEY, "version")
  .option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL)
  .option(INSERT_PARALLELISM, "8")
  .option(UPSERT_PARALLELISM, "8")
  .mode(Overwrite)
  .save(basePath)
 val readDf = spark.read.format("hudi")
   .load(basePath + "/*/*")
 readDf.createOrReplaceTempView("r1")
 spark.sql(s"select count(1) from r1").show()
{code}
A Exception throw out as follow:
{code:java}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to 
infer schema for Parquet. It must be specified manually.;Exception in thread 
"main" org.apache.spark.sql.AnalysisException: Unable to infer schema for 
Parquet. It must be specified manually.; at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:190)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:190)
 at scala.Option.getOrElse(Option.scala:121) at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:189)
 at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:417)
{code}
The generated output directory for pt = '12/20/1' is  a multi-level directory 
*${basePath}/12/20/1,* However  *pt* is a single-level partition,which result 
the query exception.

I have test the similar case using parquet in spark-sql with pt = '12/20/1' 
,the generate output dirctory is *${basePath}/12%2F20%2F1*, which have encoded 
the partition path.


> Escape the partition value  in HiveSyncTool
> -------------------------------------------
>
>                 Key: HUDI-1484
>                 URL: https://issues.apache.org/jira/browse/HUDI-1484
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: Writer Core
>            Reporter: pengzhiwei
>            Assignee: pengzhiwei
>            Priority: Major
>             Fix For: 0.7.0
>
>
> Currently Hudi will encode the partition value when 
> URL_ENCODE_PARTITIONING_OPT_KEY set true. However the HiveSyncTool has not 
> decode the partition value when sync partition to hive. And Hive will encode 
> the partition value twice which lead to an exception when query with hive sql 
> or spark sql.
> For example the partition *"2020/12/20*" will encode to *"2020%2F12%2F20"* by 
> Hudi. When HiveSyncTool sync the *"2020%2F12%2F20"* to hive, Hive will encode 
> it to "
> *"2020%252F12%252F20".* This will result to query exception for "select xx 
> from tbl where dt = '2020/12/20'
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1484) Escape the partition value in HiveSyncTool

Reply via email to