wombatu-kun opened a new issue, #11856:
URL: https://github.com/apache/hudi/issues/11856
**Describe the problem you faced**
When Spark reads `COW` table with `hiveStylePartitioning` with `glob.paths`
(`path` contains '*' or `hoodie.datasource.read.paths` is set) WITH partition
values contains '/' - partition values are incomplete (only the first part is
read as the value of column).
If table is MOW OR partitioning is not hive style OR wildcard is not used
while loading OR partition values don't contain slashes - everything works
correct.
Do have any ideas how to fix it? Or may be Hudi has nothing to do with this
because it's something wrong on Spark side?
**To Reproduce**
1. create COW table with hive_style_partitioning;
2. insert some records with '/' in values of partition column ('2024/08/29')
into this table;
3. load data from this table using wildcard (`load(s"$basePath/*/*/*")`);
4. show or print the loaded data.
Here is the test case (`extends HoodieSparkSqlTestBase`):
```
test("Test partition incomplete") {
withTempDir { tmp =>
val tbName = "wk_date"
val basePath = s"$tmp/$tbName"
val columns = Seq("id","driver","precomb","dat")
val data = Seq((1,"driver-A",6,"2021/01/01"),
(2,"driver-B",7,"2021/01/02"), (3,"driver-C",8,"2021/03/01"))
val inserts = spark.createDataFrame(data).toDF(columns:_*)
val hudi_options = Map(
"hoodie.table.name" -> tbName,
"hoodie.datasource.write.table.type"-> "COPY_ON_WRITE",
"hoodie.datasource.write.recordkey.field" -> "id",
"hoodie.datasource.write.precombine.field" -> "precomb",
"hoodie.datasource.write.partitionpath.field" -> "dat",
"hoodie.datasource.write.hive_style_partitioning" -> "true"
)
inserts.write.format("hudi").options(hudi_options)
.mode(org.apache.spark.sql.SaveMode.Overwrite).save(basePath)
val df = spark.read.format("hudi").load(s"$basePath/*/*/*")
df.select((Seq("_hoodie_partition_path") ++ columns).map(c =>
col(c)):_*).show()
}
}
```
And here is it's output (only year of `dat` values instead of full existing
dates):
```
+----------------------+---+--------+-------+----+
|_hoodie_partition_path| id| driver|precomb| dat|
+----------------------+---+--------+-------+----+
| dat=2021/03/01| 3|driver-C| 8|2021|
| dat=2021/01/02| 2|driver-B| 7|2021|
| dat=2021/01/01| 1|driver-A| 6|2021|
+----------------------+---+--------+-------+----+
```
**Expected behavior**
Values in partition column of loaded data must be the same as were written
(2021/03/01, 2021/01/02, 2021/01/01).
**Environment Description**
* Hudi version : 0.15 and actual master
* Spark version : 3.3, 3.5
* Hive version :
* Hadoop version :
* Storage (HDFS/S3/GCS..) :
* Running on Docker? (yes/no) : no
**Additional context**
I've found two tickets in Jira that may be somehow related to this issue:
https://issues.apache.org/jira/browse/HUDI-7484 - Fix partitioning style
when partition is inferred from partitionBy
https://issues.apache.org/jira/browse/HUDI-7724 - Deprecate usage of
`glob.path` to simplify read path in Spark
But there aren't any details on how to solve them.
**Stacktrace**
No errors, just incorrect values in partition column.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]