haggy commented on issue #5211:
URL: https://github.com/apache/hudi/issues/5211#issuecomment-1208430291
> The prior wildcard pattern is not meant to be used for a subset of
partitions in a Hudi table. Instead of globbing selective paths, you can still
load the Hudi table with the base path and use .filter with the partition
values so Spark does the partition pruning without scanning the whole table
@yihua I was not able to reproduce this behavior at all in our data lake
tables. Our tables are partitioned by `year/month/day/hour`. When running the
following query, Hudi first lists every file in the entire table for metadata
(which is huge so it never finishes). The only way I was able to run a query
that prunes partitions in Hudi 0.10.1 was to do the following (pyspark):
```python
read_paths = [
f"{datalake_url}/year=2022/month=7/*/*",
]
hudi_opts = {
"hoodie.datasource.query.type": "snapshot",
"hoodie.file.index.enable": False,
"hoodie.datasource.read.paths": ",".join(read_paths)
}
df = (
spark.read
.format("hudi")
.options(**hudi_opts)
.load()
)
df.count()
```
This resulted in only loading the partitions for `year=2022` and `month=7`.
In Hudi version `0.9.x` we were able to use hive-style partition glob paths
but this went away in `0.10.x` due to changes with glob path parsing:
`f"{datalake_url}/year=2022/month=7/day=*/hour=*"`
@kartik18 Could you try a couple things to confirm this behavior for you as
well?
## Load DF with file index disabled
```python
read_paths = [
f"{datalake_url}/cluster=abc/*",
f"{datalake_url}/cluster=def/*"
]
hudi_opts = {
"hoodie.datasource.query.type": "snapshot",
"hoodie.file.index.enable": False,
"hoodie.datasource.read.paths": ",".join(read_paths)
}
df = (
spark.read
.format("hudi")
.options(**hudi_opts)
.load()
)
df.count()
```
## Load DF with file index enabled
```python
read_paths = [
f"{datalake_url}/cluster=abc/*",
f"{datalake_url}/cluster=def/*"
]
hudi_opts = {
"hoodie.datasource.query.type": "snapshot",
"hoodie.datasource.read.paths": ",".join(read_paths)
}
df = (
spark.read
.format("hudi")
.options(**hudi_opts)
.load()
)
df.count()
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]