Maarten Breddels created ARROW-10736:
----------------------------------------

             Summary: [Python] feather/arrow row splitting and counting 
(Dataset API)
                 Key: ARROW-10736
                 URL: https://issues.apache.org/jira/browse/ARROW-10736
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++, Python
            Reporter: Maarten Breddels


For parquet files using the Dataset API, we have the option to access the row 
groups, and count the total number of rows within each. I don't see the option 
to get the number of rows from a dataset with feather/arrow ipc files. For 
instance, a scan without any columns is not possible it seems, nor any method 
to get the row count.

Also, if a file consists of chunked arrays, it is exposed as 1 fragment, and it 
is not possible to read only a portion of a filefragment (row slicing), similar 
to how one could work with ParquetFileFragment.split_by_row_group.

I don't know of any other way within Apache Arrow to work with feather/arrow 
ipc files and only read portions of it (e.g. a particular column for row i to 
j).

Are these features possible any other way, or is this already planned, possibly 
difficult to implement?

cheers,

Maarten



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to