Re: [I] Faster parquet partitioning scheme [arrow]

via GitHub Fri, 09 Feb 2024 07:08:41 -0800


CaselIT commented on issue #39079:
URL: https://github.com/apache/arrow/issues/39079#issuecomment-1936098178


   Thanks for the reply!
   
   Sure, but my use case is more:
   - some code does processing on all partitions at once, saves the data
   - some batch processing is launched on each partition. This needs to read 
one key once and then does it's thing on that data and then terminates
   
   I think something similar is a very common use case, where it's unfeasible 
to read the dataset only once and reuse it many times.
   
   If that's possible also opening the parquet only once in this row group 
partition results in much better results:
   ```py
   
       with timectx("load partitions from row group"):
           for key in keys:
               index = key_to_index[key,]
               with pyarrow.parquet.ParquetFile("row_group") as pf:
                   pl.from_arrow(pf.read_row_group(index))
   ```
   ```
   load partitions from row group 6425.217500000144 ms
   ```
   
   move the file open outside the for
   ```py
       with timectx("load partitions from row group - single file open"):
           with pyarrow.parquet.ParquetFile("row_group") as pf:
               for key in keys:
                   index = key_to_index[key,]
                   pl.from_arrow(pf.read_row_group(index))
   ```
   ```
   load partitions from row group - single file open 408.84460000961553 ms
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Faster parquet partitioning scheme [arrow]

Reply via email to