[
https://issues.apache.org/jira/browse/ARROW-8061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Francois Saint-Jacques resolved ARROW-8061.
-------------------------------------------
Fix Version/s: 0.17.0
Resolution: Fixed
Issue resolved by pull request 6670
[https://github.com/apache/arrow/pull/6670]
> [C++][Dataset] Ability to specify granularity of ParquetFileFragment (support
> row groups)
> -----------------------------------------------------------------------------------------
>
> Key: ARROW-8061
> URL: https://issues.apache.org/jira/browse/ARROW-8061
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++ - Dataset
> Reporter: Joris Van den Bossche
> Assignee: Ben Kietzman
> Priority: Major
> Labels: pull-request-available
> Fix For: 0.17.0
>
> Time Spent: 5.5h
> Remaining Estimate: 0h
>
> Specifically for parquet (not sure if it will be relevant for other file
> formats as well, for IPC/feather potentially ther record batch), it would be
> useful to target row groups instead of files as fragments.
> Quoting the original design documents: _"In datasets consisting of many
> fragments, the dataset API must expose the granularity of fragments in a
> public way to enable parallel processing, if desired. "._
> And a comment from Wes on that: _"a single Parquet file can "export" one or
> more fragments based on settings. The default might be to split fragments
> based on row group"_
> Currently, the level on which fragments are defined (at least in the typical
> partitioned parquet dataset) is "1 file == 1 fragment".
> Would it be possible or desirable to make this more fine grained, where you
> could also opt to have a fragment per row group?
> We could have a ParquetFragment that has this option, and a ParquetFileFormat
> specific option to say what the granularity of a fragment is (file vs row
> group)?
> cc [~fsaintjacques] [~bkietz]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)