Re: [I] Faster parquet partitioning scheme [arrow]

via GitHub Fri, 09 Feb 2024 06:33:40 -0800


ghuls commented on issue #39079:
URL: https://github.com/apache/arrow/issues/39079#issuecomment-1936041192


   You are rescanning the whole subdir on every invocation of 
`pyarrow.parquet.read_table` Scanning the filesystem only once for parquet 
files with `pa.dataset.dataset` and then filtering the data on the key column 
with dataset scanner is much faster:
   
   ```python
   In [14]: with timectx("scan for parquet files using dataset"):
        ...:      write_to_dataset_dataset = 
pa.dataset.dataset("write_to_dataset/", 
partitioning=pa.dataset.partitioning(flavor="hive"))
        ...: 
      scan for parquet files using dataset 40.98821000661701 ms
   
   In [143]: with timectx("load partitions using dataset"):
        ...:         for key in keys:
        ...:             
write_to_dataset_dataset.scanner(filter=(pc.field("key") == key)).to_table()
        ...: 
      load partitions using read_table 558.3812920376658 ms
   
   In [144]:     with timectx("load partitions using read_table"):
        ...:         for key in keys:
        ...:             pyarrow.parquet.read_table("write_to_dataset", 
filters=[("key", "==", key)])
        ...: 
      load partitions using read_table 5385.372799937613 ms
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Faster parquet partitioning scheme [arrow]

Reply via email to