Re: [I] arrow dataset: how to use date.year and date.month as partitioning [arrow]

via GitHub Wed, 23 Apr 2025 09:14:27 -0700


AlexisBRENON commented on issue #14619:
URL: https://github.com/apache/arrow/issues/14619#issuecomment-2824815924


   I endend up with such kind of solutions:
   
   ```python
   import pyarrow as pa
   from pyarrow import compute as pc, dataset as ds, fs, parquet as pq
   
   partition_timestamp = pc.strptime(  # type: ignore
       pc.binary_join_element_wise(  # type: ignore
           pc.field("year"),
           pc.field("month"),
           pc.field("day"),
           pc.field("hour"),
           pc.scalar("+0000"),
           "-",
       ),
       format="%Y-%m-%d-%H-%z",
       unit="us",
   )
   
   # Load any partiton overlapping the requested interval. All intervals are 
open-ended ([start, end[).
   # A partition interval is [P, P+1H[. It overlap the requested interval [dis, 
die[ if:
   #   cond1: it is not before -> ! (P+1H <= dis)
   #   cond2: it is not after -> ! (die <= P)
   # So: overlap = cond1 && cond2
   #             = ! (P+1H <= dis) && ! (die <= P)
   #             = ! (P+1H <= dis || die <= P)  (De Morgan's law)
   #             = ! (P <= dis - 1H  || die <= P)
   # https://stackoverflow.com/a/325964
   partition_filters = ~(
       (partition_timestamp <= data_interval_start - 
datetime.timedelta(hours=1))
       | (data_interval_end <= partition_timestamp)
   )
   
   return pd.DataFrame(
       pq.ParquetDataset(
           root_folder,
           filters=partition_filters,
       )
       .read_pandas()
       .to_pandas()
   )
   ```
   
   However, I remember of a performance issue on cloud blob store (too many 
listing or something like this) and finally rolled back to a path generation 
routine:
   
   ```python
   for ts in pd.date_range(
       data_interval_start,
       data_interval_end,
       freq="1H",
       inclusive="left",
   ).tolist():
     partition = 
f"year={ts.year}/month={ts.month:02d}/day={ts.day:02d}/hour={ts.hour:02d}/"
     yield pd.read_parquet(
       f"{root_folder}/{partition}",
     )
   ```
   
   But this seems less generic and fails to handle other partition-columns...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] arrow dataset: how to use date.year and date.month as partitioning [arrow]

Reply via email to