[ https://issues.apache.org/jira/browse/ARROW-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated ARROW-8118: ---------------------------------- Labels: features pull-request-available (was: features) > [R] dim method for FileSystemDataset > ------------------------------------ > > Key: ARROW-8118 > URL: https://issues.apache.org/jira/browse/ARROW-8118 > Project: Apache Arrow > Issue Type: Improvement > Components: R > Reporter: Sam Albers > Priority: Minor > Labels: features, pull-request-available > > I been using this function enough that I wonder if a) would be useful in the > package and b) whether this is something you think is worth working on. The > basic problem is that if you have a hierarchical file structure that > accommodates using open_dataset, it is definitely useful to know the amount > of data you are dealing with. My idea is that 'FileSystemDataset' would have > dim, nrow and ncol methods. Here is how I've been using it: > {code:java} > library(arrow) > x <- open_dataset("data/rivers-data/", partitioning = c("prov", "month")) > dim_arrow <- function(x) { > rows <- sum(purrr::map_dbl(x$files, > ~ParquetFileReader$create(.x)$ReadTable()$num_rows)) > cols <- x$schema$num_fields > > c(rows, cols) > } > dim_arrow(x) > #> [1] 426929 7 > {code} > > Ideally this would work on arrow_dplyr_query objects as well but I haven't > quite figured out how that filters based on the partitioning variables. -- This message was sent by Atlassian Jira (v8.3.4#803005)