eshu opened a new issue, #10754:
URL: https://github.com/apache/hudi/issues/10754
When the partition column contains the slash character ("/"), Hudi could
write the data incorrectly or do not read the back.
Test (I use some helpers to write and read Hudi data, they write write data
to the local FS and read it):
```scala
class HudiPartitionPathTest extends AnyFlatSpec with Matchers with TestHudi {
"Partition paths" should "be generated properly" in {
val data = rows(
(1, "one", "partition with space", ts"2024-02-26 08:25:05"),
(2, "two", "partition with space", ts"2024-02-26 08:25:05"),
(3, "three", "partition-with-dashes", ts"2024-02-26 08:25:05"),
(4, "four", "partition-with-dashes", ts"2024-02-26 08:25:05"),
(5, "five", "partition=", ts"2024-02-26 08:25:05"),
(6, "six", "partition=", ts"2024-02-26 08:25:05"),
(7, "seven", "partition%", ts"2024-02-26 08:25:05"),
(8, "eight", "partition%", ts"2024-02-26 08:25:05"),
(9, "nine", "partition/", ts"2024-02-26 08:25:05"),
(10, "ten", "partition/", ts"2024-02-26 08:25:05"),
(11, "eleven", "partition/slaanesh", ts"2024-02-26 08:25:05"),
(12, "twelve", "partition/slaanesh", ts"2024-02-26 08:25:05")
)
val path = createHudiDataset(
getClass.getName,
data,
schema("id" -> "int", "value" -> "string", "partition" -> "string")
)
val fsPartitionPaths = allFSPartitionPaths(new File(path), path.length +
1)
println(fsPartitionPaths mkString "\n")
val df = readHudiDataset(path)
df show false
val partitionPaths =
df.select("_hoodie_partition_path").dropDuplicates.collect().map(_.getString(0)).toSet
fsPartitionPaths shouldEqual partitionPaths
}
private val filter: FilenameFilter = (_, name) => !name.startsWith(".")
def allFSPartitionPaths(dir: File, prefixLength: Int): Set[String] =
(dir.listFiles(filter) foldLeft Set.empty[String]) { (paths, file) =>
if (file.isFile) paths + file.getParent.substring(prefixLength)
else paths | allFSPartitionPaths(file, prefixLength)
}
}
```
The output is
```
daas_date=partition
daas_date=partition-with-dashes
daas_date=partition with space
daas_date=partition%
daas_date=partition/slaanesh
daas_date=partition=
+-------------------+---------------------+------------------+-------------------------------+-----------------------------------------------------------------------+---+-----+---------------------+-------------------+
|_hoodie_commit_time|_hoodie_commit_seqno
|_hoodie_record_key|_hoodie_partition_path |_hoodie_file_name
|id |value|daas_date
|daas_internal_ts |
+-------------------+---------------------+------------------+-------------------------------+-----------------------------------------------------------------------+---+-----+---------------------+-------------------+
|20240226102735752 |20240226102735752_5_0|9
|daas_date=partition/
|607b4c16-93c8-4a1f-9530-f4b6be57bc9c-0_5-4-14_20240226102735752.parquet|9
|nine |partition/ |2024-02-26 08:25:05|
|20240226102735752 |20240226102735752_5_1|10
|daas_date=partition/
|607b4c16-93c8-4a1f-9530-f4b6be57bc9c-0_5-4-14_20240226102735752.parquet|10
|ten |partition/ |2024-02-26 08:25:05|
|20240226102735752 |20240226102735752_1_0|3
|daas_date=partition-with-dashes|641c2e87-276f-48eb-9a1c-eac63fed00e2-0_1-4-10_20240226102735752.parquet|3
|three|partition-with-dashes|2024-02-26 08:25:05|
|20240226102735752 |20240226102735752_1_1|4
|daas_date=partition-with-dashes|641c2e87-276f-48eb-9a1c-eac63fed00e2-0_1-4-10_20240226102735752.parquet|4
|four |partition-with-dashes|2024-02-26 08:25:05|
|20240226102735752 |20240226102735752_2_0|1
|daas_date=partition with space
|b94de450-3d40-490b-bbc7-7b1d15e5edef-0_2-4-11_20240226102735752.parquet|1
|one |partition with space |2024-02-26 08:25:05|
|20240226102735752 |20240226102735752_2_1|2
|daas_date=partition with space
|b94de450-3d40-490b-bbc7-7b1d15e5edef-0_2-4-11_20240226102735752.parquet|2
|two |partition with space |2024-02-26 08:25:05|
|20240226102735752 |20240226102735752_0_0|7
|daas_date=partition%
|2101cdfe-74b3-4268-8504-c36ab3b59f89-0_0-4-9_20240226102735752.parquet |7
|seven|partition% |2024-02-26 08:25:05|
|20240226102735752 |20240226102735752_0_1|8
|daas_date=partition%
|2101cdfe-74b3-4268-8504-c36ab3b59f89-0_0-4-9_20240226102735752.parquet |8
|eight|partition% |2024-02-26 08:25:05|
|20240226102735752 |20240226102735752_4_0|5
|daas_date=partition=
|7f548d14-a985-42ba-82d2-fca4784c8906-0_4-4-13_20240226102735752.parquet|5
|five |partition= |2024-02-26 08:25:05|
|20240226102735752 |20240226102735752_4_1|6
|daas_date=partition=
|7f548d14-a985-42ba-82d2-fca4784c8906-0_4-4-13_20240226102735752.parquet|6
|six |partition= |2024-02-26 08:25:05|
+-------------------+---------------------+------------------+-------------------------------+-----------------------------------------------------------------------+---+-----+---------------------+-------------------+
```
As you can see rows 11 and 12 was not read, and "partition" and "partition/"
on the file system have the same path (I am not sure about the impact, but
probably there could be performance issues).
Maybe it would be great to quote some characters in partition paths?
**Environment Description**
* Hudi version :
0.13.1
* Storage (HDFS/S3/GCS..):
Local FS
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]