[ 
https://issues.apache.org/jira/browse/HUDI-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pengzhiwei updated HUDI-1484:
-----------------------------
    Description: 
Currently Hudi have not encode the partition path if the partitionPath value 
contains "/". This will result in incorrect result for query. 

Here is a code which will result query exception:
{code:java}
import spark.implicits._
val df = (for (i <- 0 until 10) yield (i, s"a$i", 10 + i, 100 * i + 10000, 
"12/20/1"))
  .toDF("id", "name", "price", "version", "pt")

df.write.format("hudi")
  .option(TABLE_NAME, tableName)
  .option(RECORDKEY_FIELD_OPT_KEY, "id")
  .option(PRECOMBINE_FIELD_OPT_KEY, "version")
  .option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL)
  .option(INSERT_PARALLELISM, "8")
  .option(UPSERT_PARALLELISM, "8")
  .mode(Overwrite)
  .save(basePath)
 val readDf = spark.read.format("hudi")
   .load(basePath + "/*/*")
 readDf.createOrReplaceTempView("r1")
 spark.sql(s"select count(1) from r1").show()
{code}
A Exception throw out as follow:
{code:java}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to 
infer schema for Parquet. It must be specified manually.;Exception in thread 
"main" org.apache.spark.sql.AnalysisException: Unable to infer schema for 
Parquet. It must be specified manually.; at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:190)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:190)
 at scala.Option.getOrElse(Option.scala:121) at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:189)
 at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:417)
{code}
The generated output directory for pt = '12/20/1' is  a multi-level directory 
*${basePath}/12/20/1,* However  *pt* is a single-level partition,which result 
the query exception.

I have test the similar case using parquet in spark-sql with pt = '12/20/1' 
,the generate output dirctory is *${basePath}/12%2F20%2F1*, which have encoded 
the partition path.

  was:
Currently Hudi have not encode the partition path if the partitionPath value 
contains "/". This will result in incorrect result for query. 

Here is a code which will result query exception:
{code:java}
import spark.implicits._
val df = (for (i <- 0 until 10) yield (i, s"a$i", 10 + i, 100 * i + 10000, 
"12/20/1"))
  .toDF("id", "name", "price", "version", "pt")

df.write.format("hudi")
  .option(TABLE_NAME, tableName)
  .option(RECORDKEY_FIELD_OPT_KEY, "id")
  .option(PRECOMBINE_FIELD_OPT_KEY, "version")
  .option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL)
  .option(INSERT_PARALLELISM, "8")
  .option(UPSERT_PARALLELISM, "8")
  .mode(Overwrite)
  .save(basePath)
 val readDf = spark.read.format("hudi")
   .load(basePath + "/*/*")
 readDf.createOrReplaceTempView("r1")
 spark.sql(s"select count(1) from r1").show()
{code}
A Exception throw out as follow:
{code:java}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to 
infer schema for Parquet. It must be specified manually.;Exception in thread 
"main" org.apache.spark.sql.AnalysisException: Unable to infer schema for 
Parquet. It must be specified manually.; at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:190)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:190)
 at scala.Option.getOrElse(Option.scala:121) at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:189)
 at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:417)
{code}
The generated output directory for pt = '12/20/1' is  a multi-level directory 
*${basePath}/12/20/1,* However  *pt* is a single-stage directory,which result 
the query exception.

I have test the similar case using parquet in spark-sql with pt = '12/20/1' 
,the generate output dirctory is *${basePath}/12%2F20%2F1*, which have encoded 
the partition path.


> Missing Encode the partition path when the value contains "/"
> -------------------------------------------------------------
>
>                 Key: HUDI-1484
>                 URL: https://issues.apache.org/jira/browse/HUDI-1484
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: Writer Core
>            Reporter: pengzhiwei
>            Assignee: pengzhiwei
>            Priority: Major
>             Fix For: 0.7.0
>
>
> Currently Hudi have not encode the partition path if the partitionPath value 
> contains "/". This will result in incorrect result for query. 
> Here is a code which will result query exception:
> {code:java}
> import spark.implicits._
> val df = (for (i <- 0 until 10) yield (i, s"a$i", 10 + i, 100 * i + 10000, 
> "12/20/1"))
>   .toDF("id", "name", "price", "version", "pt")
> df.write.format("hudi")
>   .option(TABLE_NAME, tableName)
>   .option(RECORDKEY_FIELD_OPT_KEY, "id")
>   .option(PRECOMBINE_FIELD_OPT_KEY, "version")
>   .option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL)
>   .option(INSERT_PARALLELISM, "8")
>   .option(UPSERT_PARALLELISM, "8")
>   .mode(Overwrite)
>   .save(basePath)
>  val readDf = spark.read.format("hudi")
>    .load(basePath + "/*/*")
>  readDf.createOrReplaceTempView("r1")
>  spark.sql(s"select count(1) from r1").show()
> {code}
> A Exception throw out as follow:
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to 
> infer schema for Parquet. It must be specified manually.;Exception in thread 
> "main" org.apache.spark.sql.AnalysisException: Unable to infer schema for 
> Parquet. It must be specified manually.; at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:190)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:190)
>  at scala.Option.getOrElse(Option.scala:121) at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:189)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:417)
> {code}
> The generated output directory for pt = '12/20/1' is  a multi-level directory 
> *${basePath}/12/20/1,* However  *pt* is a single-level partition,which result 
> the query exception.
> I have test the similar case using parquet in spark-sql with pt = '12/20/1' 
> ,the generate output dirctory is *${basePath}/12%2F20%2F1*, which have 
> encoded the partition path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to