[jira] [Commented] (ARROW-10736) [Python] feather/arrow row splitting and counting (Dataset API)

Maarten Breddels (Jira) Thu, 26 Nov 2020 06:53:08 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-10736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239305#comment-17239305
 ]


Maarten Breddels commented on ARROW-10736:
------------------------------------------

Thanks, I tried scan with an empty schema on the fragments, which did not work, 
but using columns was the trick! Together with requiring that arrow or feather 
files are not too big, so we can only jump over fragments, this is workable for 
me for the moment.

Maybe slicing of datasets or row start/end number for scan would also get the 
job done, also for CSV files that would be interesting.

> [Python] feather/arrow row splitting and counting (Dataset API)
> ---------------------------------------------------------------
>
>                 Key: ARROW-10736
>                 URL: https://issues.apache.org/jira/browse/ARROW-10736
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Maarten Breddels
>            Priority: Major
>
> For parquet files using the Dataset API, we have the option to access the row 
> groups, and count the total number of rows within each. I don't see the 
> option to get the number of rows from a dataset with feather/arrow ipc files. 
> For instance, a scan without any columns is not possible it seems, nor any 
> method to get the row count.
> Also, if a file consists of chunked arrays, it is exposed as 1 fragment, and 
> it is not possible to read only a portion of a filefragment (row slicing), 
> similar to how one could work with ParquetFileFragment.split_by_row_group.
> I don't know of any other way within Apache Arrow to work with feather/arrow 
> ipc files and only read portions of it (e.g. a particular column for row i to 
> j).
> Are these features possible any other way, or is this already planned, 
> possibly difficult to implement?
> cheers,
> Maarten



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10736) [Python] feather/arrow row splitting and counting (Dataset API)

Reply via email to