pengzhiwei created HUDI-1484:
--------------------------------

             Summary: Missing Encode the partition path when the value contains 
"/"
                 Key: HUDI-1484
                 URL: https://issues.apache.org/jira/browse/HUDI-1484
             Project: Apache Hudi
          Issue Type: Improvement
          Components: Writer Core
            Reporter: pengzhiwei
            Assignee: pengzhiwei
             Fix For: 0.7.0


Currently Hudi have not encode the partition path if the partitionPath value 
contains "/". This will result in incorrect result for query. 

Here is a code which will result query exception:
{code:java}
import spark.implicits._
val df = (for (i <- 0 until 10) yield (i, s"a$i", 10 + i, 100 * i + 10000, 
"12/20/1"))
  .toDF("id", "name", "price", "version", "pt")

df.write.format("hudi")
  .option(TABLE_NAME, tableName)
  .option(RECORDKEY_FIELD_OPT_KEY, "id")
  .option(PRECOMBINE_FIELD_OPT_KEY, "version")
  .option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL)
  .option(INSERT_PARALLELISM, "8")
  .option(UPSERT_PARALLELISM, "8")
  .mode(Overwrite)
  .save(basePath)
 val readDf = spark.read.format("hudi")
   .load(basePath + "/*/*")
 readDf.createOrReplaceTempView("r1")
 spark.sql(s"select count(1) from r1").show()
{code}
A Exception throw out as follow:
{code:java}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to 
infer schema for Parquet. It must be specified manually.;Exception in thread 
"main" org.apache.spark.sql.AnalysisException: Unable to infer schema for 
Parquet. It must be specified manually.; at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:190)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:190)
 at scala.Option.getOrElse(Option.scala:121) at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:189)
 at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:417)
{code}
The generate output directory for pt = '12/20/1' is  a multi-level directory 
*${basePath}/12/20/1,*However *pt* is a single-stage directory,which result the 
query exception.

I have test the similar case using parquet in spark-sql with pt = '12/20/1' 
,the generate output dirctory is ${basePath}/12%2F20%2F1, which have encoded 
the partition path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to