[
https://issues.apache.org/jira/browse/HUDI-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
pengzhiwei updated HUDI-1484:
-----------------------------
Description:
Currently Hudi have not encode the partition path if the partitionPath value
contains "/". This will result in incorrect result for query.
Here is a code which will result query exception:
{code:java}
import spark.implicits._
val df = (for (i <- 0 until 10) yield (i, s"a$i", 10 + i, 100 * i + 10000,
"12/20/1"))
.toDF("id", "name", "price", "version", "pt")
df.write.format("hudi")
.option(TABLE_NAME, tableName)
.option(RECORDKEY_FIELD_OPT_KEY, "id")
.option(PRECOMBINE_FIELD_OPT_KEY, "version")
.option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL)
.option(INSERT_PARALLELISM, "8")
.option(UPSERT_PARALLELISM, "8")
.mode(Overwrite)
.save(basePath)
val readDf = spark.read.format("hudi")
.load(basePath + "/*/*")
readDf.createOrReplaceTempView("r1")
spark.sql(s"select count(1) from r1").show()
{code}
A Exception throw out as follow:
{code:java}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to
infer schema for Parquet. It must be specified manually.;Exception in thread
"main" org.apache.spark.sql.AnalysisException: Unable to infer schema for
Parquet. It must be specified manually.; at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:190)
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:190)
at scala.Option.getOrElse(Option.scala:121) at
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:189)
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:417)
{code}
The generated output directory for pt = '12/20/1' is a multi-level directory
*${basePath}/12/20/1,* However *pt* is a single-stage directory,which result
the query exception.
I have test the similar case using parquet in spark-sql with pt = '12/20/1'
,the generate output dirctory is *${basePath}/12%2F20%2F1*, which have encoded
the partition path.
was:
Currently Hudi have not encode the partition path if the partitionPath value
contains "/". This will result in incorrect result for query.
Here is a code which will result query exception:
{code:java}
import spark.implicits._
val df = (for (i <- 0 until 10) yield (i, s"a$i", 10 + i, 100 * i + 10000,
"12/20/1"))
.toDF("id", "name", "price", "version", "pt")
df.write.format("hudi")
.option(TABLE_NAME, tableName)
.option(RECORDKEY_FIELD_OPT_KEY, "id")
.option(PRECOMBINE_FIELD_OPT_KEY, "version")
.option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL)
.option(INSERT_PARALLELISM, "8")
.option(UPSERT_PARALLELISM, "8")
.mode(Overwrite)
.save(basePath)
val readDf = spark.read.format("hudi")
.load(basePath + "/*/*")
readDf.createOrReplaceTempView("r1")
spark.sql(s"select count(1) from r1").show()
{code}
A Exception throw out as follow:
{code:java}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to
infer schema for Parquet. It must be specified manually.;Exception in thread
"main" org.apache.spark.sql.AnalysisException: Unable to infer schema for
Parquet. It must be specified manually.; at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:190)
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:190)
at scala.Option.getOrElse(Option.scala:121) at
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:189)
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:417)
{code}
The generate output directory for pt = '12/20/1' is a multi-level directory
*${basePath}/12/20/1,* However *pt* is a single-stage directory,which result
the query exception.
I have test the similar case using parquet in spark-sql with pt = '12/20/1'
,the generate output dirctory is *${basePath}/12%2F20%2F1*, which have encoded
the partition path.
> Missing Encode the partition path when the value contains "/"
> -------------------------------------------------------------
>
> Key: HUDI-1484
> URL: https://issues.apache.org/jira/browse/HUDI-1484
> Project: Apache Hudi
> Issue Type: Improvement
> Components: Writer Core
> Reporter: pengzhiwei
> Assignee: pengzhiwei
> Priority: Major
> Fix For: 0.7.0
>
>
> Currently Hudi have not encode the partition path if the partitionPath value
> contains "/". This will result in incorrect result for query.
> Here is a code which will result query exception:
> {code:java}
> import spark.implicits._
> val df = (for (i <- 0 until 10) yield (i, s"a$i", 10 + i, 100 * i + 10000,
> "12/20/1"))
> .toDF("id", "name", "price", "version", "pt")
> df.write.format("hudi")
> .option(TABLE_NAME, tableName)
> .option(RECORDKEY_FIELD_OPT_KEY, "id")
> .option(PRECOMBINE_FIELD_OPT_KEY, "version")
> .option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL)
> .option(INSERT_PARALLELISM, "8")
> .option(UPSERT_PARALLELISM, "8")
> .mode(Overwrite)
> .save(basePath)
> val readDf = spark.read.format("hudi")
> .load(basePath + "/*/*")
> readDf.createOrReplaceTempView("r1")
> spark.sql(s"select count(1) from r1").show()
> {code}
> A Exception throw out as follow:
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to
> infer schema for Parquet. It must be specified manually.;Exception in thread
> "main" org.apache.spark.sql.AnalysisException: Unable to infer schema for
> Parquet. It must be specified manually.; at
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:190)
> at
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:190)
> at scala.Option.getOrElse(Option.scala:121) at
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:189)
> at
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:417)
> {code}
> The generated output directory for pt = '12/20/1' is a multi-level directory
> *${basePath}/12/20/1,* However *pt* is a single-stage directory,which result
> the query exception.
> I have test the similar case using parquet in spark-sql with pt = '12/20/1'
> ,the generate output dirctory is *${basePath}/12%2F20%2F1*, which have
> encoded the partition path.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)