[jira] [Commented] (ARROW-10736) [Python] feather/arrow row splitting and counting (Dataset API)

Joris Van den Bossche (Jira) Thu, 26 Nov 2020 04:30:37 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-10736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239229#comment-17239229
 ]


Joris Van den Bossche commented on ARROW-10736:
-----------------------------------------------

bq. Also, if a file consists of chunked arrays, it is exposed as 1 fragment, 
and it is not possible to read only a portion of a filefragment (row slicing), 
similar to how one could work with ParquetFileFragment.split_by_row_group.

Yeah, you can get an iterator of the record batches (instead of a full table), 
eg with:

{code}
In [21]: fragment = list(dataset.get_fragments())[0]

In [22]: list(fragment.to_batches())
Out[22]: 
[pyarrow.RecordBatch
 a: int64
 b: double,
 pyarrow.RecordBatch
 a: int64
 b: double]
{code}

But I don't think it's possible right now to _only_ read a specific RecordBatch 
(like is possible with parquet with the "split_by_row_group" or "subset" 
methods). I think it would be nice to add this to IPC/Feather format as well, 
as it should technically perfectly be possible.

> [Python] feather/arrow row splitting and counting (Dataset API)
> ---------------------------------------------------------------
>
>                 Key: ARROW-10736
>                 URL: https://issues.apache.org/jira/browse/ARROW-10736
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Maarten Breddels
>            Priority: Major
>
> For parquet files using the Dataset API, we have the option to access the row 
> groups, and count the total number of rows within each. I don't see the 
> option to get the number of rows from a dataset with feather/arrow ipc files. 
> For instance, a scan without any columns is not possible it seems, nor any 
> method to get the row count.
> Also, if a file consists of chunked arrays, it is exposed as 1 fragment, and 
> it is not possible to read only a portion of a filefragment (row slicing), 
> similar to how one could work with ParquetFileFragment.split_by_row_group.
> I don't know of any other way within Apache Arrow to work with feather/arrow 
> ipc files and only read portions of it (e.g. a particular column for row i to 
> j).
> Are these features possible any other way, or is this already planned, 
> possibly difficult to implement?
> cheers,
> Maarten



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10736) [Python] feather/arrow row splitting and counting (Dataset API)

Reply via email to