[
https://issues.apache.org/jira/browse/ARROW-3244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wes McKinney updated ARROW-3244:
--------------------------------
Summary: [Python] Multi-file parquet loading without scan (was: Multi-file
parquet loading without scan)
> [Python] Multi-file parquet loading without scan
> ------------------------------------------------
>
> Key: ARROW-3244
> URL: https://issues.apache.org/jira/browse/ARROW-3244
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Martin Durant
> Priority: Major
> Labels: parquet
>
> A number of mechanism are possible to avoid having to access and read the
> parquet footers in a data set consisting of a number of files. In the case of
> a large number of data files (perhaps split with directory partitioning) and
> remote storage, this can be a significant overhead. This is significant from
> the point of view of Dask, which must have the metadata available in the
> client before setting up computational graphs.
>
> Here are some suggestions of what could be done.
>
> * some parquet writing frameworks include a `_metadata` file, which contains
> all the information from the footers of the various files. If this file is
> present, then this data can be read from one place, with a single file
> access. For a large number of files, parsing the thrift information may, by
> itself, be a non-negligible overhead≥
> * the schema (dtypes) can be found in a `_common_metadata`, or from any one
> of the data-files, then the schema could be assumed (perhaps at the user's
> option) to be the same for all of the files. However, the information about
> the directory partitioning would not be available. Although Dask may infer
> the information from the filenames, it would be preferable to go through the
> machinery with parquet-cpp, and view the whole data-set as a single object.
> Note that the files will still need to have the footer read to access the
> data, for the bytes offsets, but from Dask's point of view, this would be
> deferred to tasks running in parallel.
> (please forgive that some of this has already been mentioned elsewhere; this
> is one of the entries in the list at
> [https://github.com/dask/fastparquet/issues/374] as a feature that is useful
> in fastparquet)
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)