[jira] [Commented] (ARROW-8118) [R] dim method for FileSystemDataset

Neal Richardson (Jira) Fri, 13 Mar 2020 15:46:53 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17059118#comment-17059118
 ]


Neal Richardson commented on ARROW-8118:
----------------------------------------

Yeah, ncol is easy to get from the Dataset schema (and you can get it from 
arrow_dplyr_query too, incorporating any select statement). Your solution for 
nrow is clever, but you're right, that won't generalize to query objects, and 
looking ahead, you'd have to switch the behavior based on the FileFormat since 
Datasets can be based on other file types. 

I'd be happy to review a PR with nrow and ncol methods, and dim that wraps 
them, erroring gracefully where unsupported (i.e. if filtered). IMO the C++ 
library should be able to tell us nrow, and it may already have the relevant 
file/row group statistics in memory, but we can start with something that works 
in R and worry about pushing the work down to C++ later.

> [R] dim method for FileSystemDataset
> ------------------------------------
>
>                 Key: ARROW-8118
>                 URL: https://issues.apache.org/jira/browse/ARROW-8118
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Sam Albers
>            Priority: Minor
>              Labels: features
>
> I been using this function enough that I wonder if a) would be useful in the 
> package and b) whether this is something you think is worth working on. The 
> basic problem is that if you have a hierarchical file structure that 
> accommodates using open_dataset, it is definitely useful to know the amount 
> of data you are dealing with. My idea is that 'FileSystemDataset' would have 
> dim, nrow and ncol methods. Here is how I've been using it:
> {code:java}
> library(arrow)
> x <- open_dataset("data/rivers-data/", partitioning = c("prov", "month"))
> dim_arrow <- function(x) {
>  rows <- sum(purrr::map_dbl(x$files, 
> ~ParquetFileReader$create(.x)$ReadTable()$num_rows))
>  cols <- x$schema$num_fields
>  
>  c(rows, cols)
> }
> dim_arrow(x)
> #> [1] 426929 7
> {code}
>  
> Ideally this would work on arrow_dplyr_query objects as well but I haven't 
> quite figured out how that filters based on the partitioning variables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8118) [R] dim method for FileSystemDataset

Reply via email to