AlexisBRENON commented on issue #14619: URL: https://github.com/apache/arrow/issues/14619#issuecomment-2824815924
I endend up with such kind of solutions: ```python import pyarrow as pa from pyarrow import compute as pc, dataset as ds, fs, parquet as pq partition_timestamp = pc.strptime( # type: ignore pc.binary_join_element_wise( # type: ignore pc.field("year"), pc.field("month"), pc.field("day"), pc.field("hour"), pc.scalar("+0000"), "-", ), format="%Y-%m-%d-%H-%z", unit="us", ) # Load any partiton overlapping the requested interval. All intervals are open-ended ([start, end[). # A partition interval is [P, P+1H[. It overlap the requested interval [dis, die[ if: # cond1: it is not before -> ! (P+1H <= dis) # cond2: it is not after -> ! (die <= P) # So: overlap = cond1 && cond2 # = ! (P+1H <= dis) && ! (die <= P) # = ! (P+1H <= dis || die <= P) (De Morgan's law) # = ! (P <= dis - 1H || die <= P) # https://stackoverflow.com/a/325964 partition_filters = ~( (partition_timestamp <= data_interval_start - datetime.timedelta(hours=1)) | (data_interval_end <= partition_timestamp) ) return pd.DataFrame( pq.ParquetDataset( root_folder, filters=partition_filters, ) .read_pandas() .to_pandas() ) ``` However, I remember of a performance issue on cloud blob store (too many listing or something like this) and finally rolled back to a path generation routine: ```python for ts in pd.date_range( data_interval_start, data_interval_end, freq="1H", inclusive="left", ).tolist(): partition = f"year={ts.year}/month={ts.month:02d}/day={ts.day:02d}/hour={ts.hour:02d}/" yield pd.read_parquet( f"{root_folder}/{partition}", ) ``` But this seems less generic and fails to handle other partition-columns... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org