[
https://issues.apache.org/jira/browse/ARROW-10736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239229#comment-17239229
]
Joris Van den Bossche commented on ARROW-10736:
-----------------------------------------------
bq. Also, if a file consists of chunked arrays, it is exposed as 1 fragment,
and it is not possible to read only a portion of a filefragment (row slicing),
similar to how one could work with ParquetFileFragment.split_by_row_group.
Yeah, you can get an iterator of the record batches (instead of a full table),
eg with:
{code}
In [21]: fragment = list(dataset.get_fragments())[0]
In [22]: list(fragment.to_batches())
Out[22]:
[pyarrow.RecordBatch
a: int64
b: double,
pyarrow.RecordBatch
a: int64
b: double]
{code}
But I don't think it's possible right now to _only_ read a specific RecordBatch
(like is possible with parquet with the "split_by_row_group" or "subset"
methods). I think it would be nice to add this to IPC/Feather format as well,
as it should technically perfectly be possible.
> [Python] feather/arrow row splitting and counting (Dataset API)
> ---------------------------------------------------------------
>
> Key: ARROW-10736
> URL: https://issues.apache.org/jira/browse/ARROW-10736
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++, Python
> Reporter: Maarten Breddels
> Priority: Major
>
> For parquet files using the Dataset API, we have the option to access the row
> groups, and count the total number of rows within each. I don't see the
> option to get the number of rows from a dataset with feather/arrow ipc files.
> For instance, a scan without any columns is not possible it seems, nor any
> method to get the row count.
> Also, if a file consists of chunked arrays, it is exposed as 1 fragment, and
> it is not possible to read only a portion of a filefragment (row slicing),
> similar to how one could work with ParquetFileFragment.split_by_row_group.
> I don't know of any other way within Apache Arrow to work with feather/arrow
> ipc files and only read portions of it (e.g. a particular column for row i to
> j).
> Are these features possible any other way, or is this already planned,
> possibly difficult to implement?
> cheers,
> Maarten
--
This message was sent by Atlassian Jira
(v8.3.4#803005)