Joris Van den Bossche created ARROW-8061:
--------------------------------------------
Summary: [C++][Dataset] Ability to specify granularity of
ParquetFileFragment (support row groups)
Key: ARROW-8061
URL: https://issues.apache.org/jira/browse/ARROW-8061
Project: Apache Arrow
Issue Type: Improvement
Components: C++ - Dataset
Reporter: Joris Van den Bossche
Specifically for parquet (not sure if it will be relevant for other file
formats as well, for IPC/feather potentially ther record batch), it would be
useful to target row groups instead of files as fragments.
Quoting the original design documents: _"In datasets consisting of many
fragments, the dataset API must expose the granularity of fragments in a public
way to enable parallel processing, if desired. "._
And a comment from Wes on that: _"a single Parquet file can "export" one or
more fragments based on settings. The default might be to split fragments based
on row group"_
Currently, the level on which fragments are defined (at least in the typical
partitioned parquet dataset) is "1 file == 1 fragment".
Would it be possible or desirable to make this more fine grained, where you
could also opt to have a fragment per row group?
We could have a ParquetFragment that has this option, and a ParquetFileFormat
specific option to say what the granularity of a fragment is (file vs row
group)?
cc [~fsaintjacques] [~bkietz]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)