AlexisBRENON commented on issue #14619:
URL: https://github.com/apache/arrow/issues/14619#issuecomment-2824815924
I endend up with such kind of solutions:
```python
import pyarrow as pa
from pyarrow import compute as pc, dataset as ds, fs, parquet as pq
partition_timestamp = pc.strptime( # type: ignore
pc.binary_join_element_wise( # type: ignore
pc.field("year"),
pc.field("month"),
pc.field("day"),
pc.field("hour"),
pc.scalar("+0000"),
"-",
),
format="%Y-%m-%d-%H-%z",
unit="us",
)
# Load any partiton overlapping the requested interval. All intervals are
open-ended ([start, end[).
# A partition interval is [P, P+1H[. It overlap the requested interval [dis,
die[ if:
# cond1: it is not before -> ! (P+1H <= dis)
# cond2: it is not after -> ! (die <= P)
# So: overlap = cond1 && cond2
# = ! (P+1H <= dis) && ! (die <= P)
# = ! (P+1H <= dis || die <= P) (De Morgan's law)
# = ! (P <= dis - 1H || die <= P)
# https://stackoverflow.com/a/325964
partition_filters = ~(
(partition_timestamp <= data_interval_start -
datetime.timedelta(hours=1))
| (data_interval_end <= partition_timestamp)
)
return pd.DataFrame(
pq.ParquetDataset(
root_folder,
filters=partition_filters,
)
.read_pandas()
.to_pandas()
)
```
However, I remember of a performance issue on cloud blob store (too many
listing or something like this) and finally rolled back to a path generation
routine:
```python
for ts in pd.date_range(
data_interval_start,
data_interval_end,
freq="1H",
inclusive="left",
).tolist():
partition =
f"year={ts.year}/month={ts.month:02d}/day={ts.day:02d}/hour={ts.hour:02d}/"
yield pd.read_parquet(
f"{root_folder}/{partition}",
)
```
But this seems less generic and fails to handle other partition-columns...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]