[
https://issues.apache.org/jira/browse/ARROW-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Neal Richardson updated ARROW-8118:
-----------------------------------
Summary: [R] dim method for FileSystemDataset (was: dim method for
FileSystemDataset)
> [R] dim method for FileSystemDataset
> ------------------------------------
>
> Key: ARROW-8118
> URL: https://issues.apache.org/jira/browse/ARROW-8118
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Reporter: Sam Albers
> Priority: Major
> Labels: features
>
> I been using this function enough that I wonder if a) would be useful in the
> package and b) whether this is something you think is worth working on. The
> basic problem is that if you have a hierarchical file structure that
> accommodates using open_dataset, it is definitely useful to know the amount
> of data you are dealing with. My idea is that 'FileSystemDataset' would have
> dim, nrow and ncol methods. Here is how I've been using it:
> {code:java}
> library(arrow)
> x <- open_dataset("data/rivers-data/", partitioning = c("prov", "month"))
> dim_arrow <- function(x) {
> rows <- sum(purrr::map_dbl(x$files,
> ~ParquetFileReader$create(.x)$ReadTable()$num_rows))
> cols <- x$schema$num_fields
>
> c(rows, cols)
> }
> dim_arrow(x)
> #> [1] 426929 7
> {code}
>
> Ideally this would work on arrow_dplyr_query objects as well but I haven't
> quite figured out how that filters based on the partitioning variables.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)