[I] [SUPPORT] COW+hiveStylePartitioning+glob.paths on Spark: reads incomplete values of partition column [hudi]

via GitHub Wed, 28 Aug 2024 21:19:25 -0700


wombatu-kun opened a new issue, #11856:
URL: https://github.com/apache/hudi/issues/11856


   **Describe the problem you faced**
   
   When Spark reads `COW` table with `hiveStylePartitioning` with `glob.paths` 
(`path` contains '*' or `hoodie.datasource.read.paths` is set)  WITH partition 
values contains '/' - partition values are incomplete (only the first part is 
read as the value of column).  
   If table is MOW OR partitioning is not hive style OR wildcard is not used 
while loading OR partition values don't contain slashes - everything works 
correct.
   
   Do have any ideas how to fix it? Or may be Hudi has nothing to do with this 
because it's something wrong on Spark side?
   
   **To Reproduce**
   
   1. create COW table with hive_style_partitioning;  
   2. insert some records with '/' in values of partition column ('2024/08/29') 
into this table;  
   3. load data from this table using wildcard (`load(s"$basePath/*/*/*")`);  
   4. show or print the loaded data.  
   
   Here is the test case (`extends HoodieSparkSqlTestBase`):
   ```
     test("Test partition incomplete") {
       withTempDir { tmp =>
         val tbName = "wk_date"
         val basePath = s"$tmp/$tbName"
         val columns = Seq("id","driver","precomb","dat")
         val data = Seq((1,"driver-A",6,"2021/01/01"), 
(2,"driver-B",7,"2021/01/02"), (3,"driver-C",8,"2021/03/01"))
   
         val inserts = spark.createDataFrame(data).toDF(columns:_*)
   
         val hudi_options = Map(
           "hoodie.table.name" -> tbName,
           "hoodie.datasource.write.table.type"-> "COPY_ON_WRITE",
           "hoodie.datasource.write.recordkey.field" -> "id",
           "hoodie.datasource.write.precombine.field" -> "precomb",
           "hoodie.datasource.write.partitionpath.field" -> "dat",
           "hoodie.datasource.write.hive_style_partitioning" -> "true"
           )
         inserts.write.format("hudi").options(hudi_options)
           .mode(org.apache.spark.sql.SaveMode.Overwrite).save(basePath)
   
         val df = spark.read.format("hudi").load(s"$basePath/*/*/*")
         df.select((Seq("_hoodie_partition_path") ++ columns).map(c => 
col(c)):_*).show()
       }
     }
   ```
   And here is it's output (only year of `dat` values instead of full existing 
dates):  
   ```
   +----------------------+---+--------+-------+----+
   |_hoodie_partition_path| id|  driver|precomb| dat|
   +----------------------+---+--------+-------+----+
   |        dat=2021/03/01|  3|driver-C|      8|2021|
   |        dat=2021/01/02|  2|driver-B|      7|2021|
   |        dat=2021/01/01|  1|driver-A|      6|2021|
   +----------------------+---+--------+-------+----+
   ```
   
   **Expected behavior**
   
   Values in partition column of loaded data must be the same as were written 
(2021/03/01, 2021/01/02, 2021/01/01).
   
   **Environment Description**
   
   * Hudi version : 0.15 and actual master
   
   * Spark version : 3.3, 3.5
   
   * Hive version : 
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) :
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   I've found two tickets in Jira that may be somehow related to this issue:  
   https://issues.apache.org/jira/browse/HUDI-7484 - Fix partitioning style 
when partition is inferred from partitionBy  
   https://issues.apache.org/jira/browse/HUDI-7724 - Deprecate usage of 
`glob.path` to simplify read path in Spark  
   But there aren't any details on how to solve them.
   
   **Stacktrace**
   
   No errors, just incorrect values in partition column.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] COW+hiveStylePartitioning+glob.paths on Spark: reads incomplete values of partition column [hudi]

Reply via email to