Joris Van den Bossche created ARROW-8061:
--------------------------------------------

             Summary: [C++][Dataset] Ability to specify granularity of 
ParquetFileFragment (support row groups)
                 Key: ARROW-8061
                 URL: https://issues.apache.org/jira/browse/ARROW-8061
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++ - Dataset
            Reporter: Joris Van den Bossche


Specifically for parquet (not sure if it will be relevant for other file 
formats as well, for IPC/feather potentially ther record batch), it would be 
useful to target row groups instead of files as fragments.

Quoting the original design documents: _"In datasets consisting of many 
fragments, the dataset API must expose the granularity of fragments in a public 
way to enable parallel processing, if desired. "._   
And a comment from Wes on that: _"a single Parquet file can "export" one or 
more fragments based on settings. The default might be to split fragments based 
on row group"_

Currently, the level on which fragments are defined (at least in the typical 
partitioned parquet dataset) is "1 file == 1 fragment".

Would it be possible or desirable to make this more fine grained, where you 
could also opt to have a fragment per row group?   
We could have a ParquetFragment that has this option, and a ParquetFileFormat 
specific option to say what the granularity of a fragment is (file vs row 
group)?

cc [~fsaintjacques] [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to