Maarten Breddels created ARROW-10736:
----------------------------------------
Summary: [Python] feather/arrow row splitting and counting
(Dataset API)
Key: ARROW-10736
URL: https://issues.apache.org/jira/browse/ARROW-10736
Project: Apache Arrow
Issue Type: Improvement
Components: C++, Python
Reporter: Maarten Breddels
For parquet files using the Dataset API, we have the option to access the row
groups, and count the total number of rows within each. I don't see the option
to get the number of rows from a dataset with feather/arrow ipc files. For
instance, a scan without any columns is not possible it seems, nor any method
to get the row count.
Also, if a file consists of chunked arrays, it is exposed as 1 fragment, and it
is not possible to read only a portion of a filefragment (row slicing), similar
to how one could work with ParquetFileFragment.split_by_row_group.
I don't know of any other way within Apache Arrow to work with feather/arrow
ipc files and only read portions of it (e.g. a particular column for row i to
j).
Are these features possible any other way, or is this already planned, possibly
difficult to implement?
cheers,
Maarten
--
This message was sent by Atlassian Jira
(v8.3.4#803005)