eshu opened a new issue, #10754:
URL: https://github.com/apache/hudi/issues/10754

   When the partition column contains the slash character ("/"), Hudi could 
write the data incorrectly or do not read the back.
   
   Test (I use some helpers to write and read Hudi data, they write write data 
to the local FS and read it):
   ```scala
   class HudiPartitionPathTest extends AnyFlatSpec with Matchers with TestHudi {
     "Partition paths" should "be generated properly" in {
       val data = rows(
         (1, "one", "partition with space", ts"2024-02-26 08:25:05"),
         (2, "two", "partition with space", ts"2024-02-26 08:25:05"),
         (3, "three", "partition-with-dashes", ts"2024-02-26 08:25:05"),
         (4, "four", "partition-with-dashes", ts"2024-02-26 08:25:05"),
         (5, "five", "partition=", ts"2024-02-26 08:25:05"),
         (6, "six", "partition=", ts"2024-02-26 08:25:05"),
         (7, "seven", "partition%", ts"2024-02-26 08:25:05"),
         (8, "eight", "partition%", ts"2024-02-26 08:25:05"),
         (9, "nine", "partition/", ts"2024-02-26 08:25:05"),
         (10, "ten", "partition/", ts"2024-02-26 08:25:05"),
         (11, "eleven", "partition/slaanesh", ts"2024-02-26 08:25:05"),
         (12, "twelve", "partition/slaanesh", ts"2024-02-26 08:25:05")
       )
       val path = createHudiDataset(
         getClass.getName,
         data,
         schema("id" -> "int", "value" -> "string", "partition" -> "string")
       )
       val fsPartitionPaths = allFSPartitionPaths(new File(path), path.length + 
1)
       println(fsPartitionPaths mkString "\n")
       val df = readHudiDataset(path)
       df show false
       val partitionPaths = 
df.select("_hoodie_partition_path").dropDuplicates.collect().map(_.getString(0)).toSet
       fsPartitionPaths shouldEqual partitionPaths
     }
   
     private val filter: FilenameFilter = (_, name) => !name.startsWith(".")
   
     def allFSPartitionPaths(dir: File, prefixLength: Int): Set[String] =
       (dir.listFiles(filter) foldLeft Set.empty[String]) { (paths, file) =>
         if (file.isFile) paths + file.getParent.substring(prefixLength)
         else paths | allFSPartitionPaths(file, prefixLength)
       }
   }
   ```
   
   The output is
   ```
   daas_date=partition
   daas_date=partition-with-dashes
   daas_date=partition with space
   daas_date=partition%
   daas_date=partition/slaanesh
   daas_date=partition=
   
+-------------------+---------------------+------------------+-------------------------------+-----------------------------------------------------------------------+---+-----+---------------------+-------------------+
   |_hoodie_commit_time|_hoodie_commit_seqno 
|_hoodie_record_key|_hoodie_partition_path         |_hoodie_file_name           
                                           |id |value|daas_date            
|daas_internal_ts   |
   
+-------------------+---------------------+------------------+-------------------------------+-----------------------------------------------------------------------+---+-----+---------------------+-------------------+
   |20240226102735752  |20240226102735752_5_0|9                 
|daas_date=partition/           
|607b4c16-93c8-4a1f-9530-f4b6be57bc9c-0_5-4-14_20240226102735752.parquet|9  
|nine |partition/           |2024-02-26 08:25:05|
   |20240226102735752  |20240226102735752_5_1|10                
|daas_date=partition/           
|607b4c16-93c8-4a1f-9530-f4b6be57bc9c-0_5-4-14_20240226102735752.parquet|10 
|ten  |partition/           |2024-02-26 08:25:05|
   |20240226102735752  |20240226102735752_1_0|3                 
|daas_date=partition-with-dashes|641c2e87-276f-48eb-9a1c-eac63fed00e2-0_1-4-10_20240226102735752.parquet|3
  |three|partition-with-dashes|2024-02-26 08:25:05|
   |20240226102735752  |20240226102735752_1_1|4                 
|daas_date=partition-with-dashes|641c2e87-276f-48eb-9a1c-eac63fed00e2-0_1-4-10_20240226102735752.parquet|4
  |four |partition-with-dashes|2024-02-26 08:25:05|
   |20240226102735752  |20240226102735752_2_0|1                 
|daas_date=partition with space 
|b94de450-3d40-490b-bbc7-7b1d15e5edef-0_2-4-11_20240226102735752.parquet|1  
|one  |partition with space |2024-02-26 08:25:05|
   |20240226102735752  |20240226102735752_2_1|2                 
|daas_date=partition with space 
|b94de450-3d40-490b-bbc7-7b1d15e5edef-0_2-4-11_20240226102735752.parquet|2  
|two  |partition with space |2024-02-26 08:25:05|
   |20240226102735752  |20240226102735752_0_0|7                 
|daas_date=partition%           
|2101cdfe-74b3-4268-8504-c36ab3b59f89-0_0-4-9_20240226102735752.parquet |7  
|seven|partition%           |2024-02-26 08:25:05|
   |20240226102735752  |20240226102735752_0_1|8                 
|daas_date=partition%           
|2101cdfe-74b3-4268-8504-c36ab3b59f89-0_0-4-9_20240226102735752.parquet |8  
|eight|partition%           |2024-02-26 08:25:05|
   |20240226102735752  |20240226102735752_4_0|5                 
|daas_date=partition=           
|7f548d14-a985-42ba-82d2-fca4784c8906-0_4-4-13_20240226102735752.parquet|5  
|five |partition=           |2024-02-26 08:25:05|
   |20240226102735752  |20240226102735752_4_1|6                 
|daas_date=partition=           
|7f548d14-a985-42ba-82d2-fca4784c8906-0_4-4-13_20240226102735752.parquet|6  
|six  |partition=           |2024-02-26 08:25:05|
   
+-------------------+---------------------+------------------+-------------------------------+-----------------------------------------------------------------------+---+-----+---------------------+-------------------+
   ```
   As you can see rows 11 and 12 was not read, and "partition" and "partition/" 
on the file system have the same path (I am not sure about the impact, but 
probably there could be performance issues).
   
   Maybe it would be great to quote some characters in partition paths?
   
   **Environment Description**
   
   * Hudi version :
   0.13.1
   
   * Storage (HDFS/S3/GCS..):
   Local FS
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to