Robert Gruener created ARROW-2656:
-------------------------------------
Summary: [Python] Improve ParquetManifest creation time for highly
Key: ARROW-2656
URL: https://issues.apache.org/jira/browse/ARROW-2656
Project: Apache Arrow
Issue Type: Improvement
Components: Python
Reporter: Robert Gruener
When a parquet dataset is highly partitioned, the time to call the constructor
for
[ParquetManifest|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L588][
|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L588]takes
a significant amount of time since it serially visits directories to find all
parquet files. In a dataset with thousands of partition values this can take
several minutes from a personal laptop.
A quick win to vastly improve this performance would be to use a ThreadPool to
have calls to {{_visit_level}} happen concurrently to prevent wasting a ton of
time waiting on I/O.
An even faster option could be to allow for optional indexing of dataset
metadata in something like the {{common_metadata}}. This could contain all
files in the manifest and their row_group information. This would also allow
for
[split_row_groups|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L746]
to be implemented efficiently without needing to open every parquet file in
the dataset to retrieve the metadata which is quite time consuming for large
datasets. The main problem with the indexing approach are it requires
immutability of the dataset, which doesn't seem too unreasonable. This specific
implementation seems related to
https://issues.apache.org/jira/browse/ARROW-1983 however that only covers the
write portion.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)