Re: [I] Faster parquet partitioning scheme [arrow]

via GitHub Fri, 09 Feb 2024 07:14:02 -0800


CaselIT commented on issue #39079:
URL: https://github.com/apache/arrow/issues/39079#issuecomment-1936106918


   For reference on my pc your suggestion is on part compared with re-opening 
the row group file each time in the for. This gives me the following times:
   ```py
       import pyarrow.compute as pc
       import pyarrow.dataset
   
       with timectx("load partitions using read_table - read dataset once"):
           write_to_dataset_dataset = pyarrow.dataset.dataset(
               "write_to_dataset/", 
partitioning=pyarrow.dataset.partitioning(flavor="hive")
           )
           for key in keys:
               
pl.from_arrow(write_to_dataset_dataset.scanner(filter=(pc.field("key") == 
key)).to_table())
   ```
   ```
   load partitions using read_table - read dataset once 5504.083399995579 ms
   ```
   
   So I still think this scheme has significant advantages compared to hive


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Faster parquet partitioning scheme [arrow]

Reply via email to