[
https://issues.apache.org/jira/browse/ARROW-15260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516796#comment-17516796
]
Dewey Dunnington commented on ARROW-15260:
------------------------------------------
It looks like we can access {{__filename}} from the {{Scanner}} too, although
it's pretty limited what we do with it. Note that in R you will have to use
backticks in something like dplyr (e.g., {{`__filename`}}, because variables in
R can't start with {{_}}. In the dplyr interface we make a pretty strong
assumption that the schema names are the available names in the dataset...maybe
the best way would be to add a binding like {{dataset_filename()}} that inserts
the correct field reference (although C++ gives us errors if we try to insert a
field reference to {{__filename}} in an {{Expression}}).
{code:R}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
tf <- tempfile()
write_dataset(mtcars, tf, partitioning = "cyl")
ds <- open_dataset(tf)
# works!
scanner <- Scanner$create(
open_dataset(tf),
projection = c("__filename", names(ds))
)
as_tibble(scanner$ToTable())
#> # A tibble: 32 × 12
#> `__filename` mpg disp hp drat wt qsec vs am gear
carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
<dbl>
#> 1 /private/var/fol… 22.8 108 93 3.85 2.32 18.6 1 1 4
1
#> 2 /private/var/fol… 24.4 147. 62 3.69 3.19 20 1 0 4
2
#> 3 /private/var/fol… 22.8 141. 95 3.92 3.15 22.9 1 0 4
2
#> 4 /private/var/fol… 32.4 78.7 66 4.08 2.2 19.5 1 1 4
1
#> 5 /private/var/fol… 30.4 75.7 52 4.93 1.62 18.5 1 1 4
2
#> 6 /private/var/fol… 33.9 71.1 65 4.22 1.84 19.9 1 1 4
1
#> 7 /private/var/fol… 21.5 120. 97 3.7 2.46 20.0 1 0 3
1
#> 8 /private/var/fol… 27.3 79 66 4.08 1.94 18.9 1 1 4
1
#> 9 /private/var/fol… 26 120. 91 4.43 2.14 16.7 0 1 5
2
#> 10 /private/var/fol… 30.4 95.1 113 3.77 1.51 16.9 1 1 5
2
#> # … with 22 more rows, and 1 more variable: cyl <int>
# seems that we still can't use __filename in a filter expr
Scanner$create(
open_dataset(tf),
projection = c("__filename", names(ds)),
filter = Expression$create(
"match_substring",
Expression$field_ref("__filename"),
options = list(pattern = "cyl=8")
)
)
#> Error: Invalid: No match for FieldRef.Name(__filename) in mpg: double
#> disp: double
#> hp: double
#> drat: double
#> wt: double
#> qsec: double
#> vs: double
#> am: double
#> gear: double
#> carb: double
#> cyl: int32
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/type.h:1717
CheckNonEmpty(matches, root)
#>
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/dataset/scanner.cc:782
ref.FindOne(*scan_options_->dataset_schema)
{code}
> [R] open_dataset - add file_name as column
> ------------------------------------------
>
> Key: ARROW-15260
> URL: https://issues.apache.org/jira/browse/ARROW-15260
> Project: Apache Arrow
> Issue Type: New Feature
> Components: R
> Reporter: Martin du Toit
> Priority: Minor
>
> Hi. Is it possible to add the file_name as a column to a dataset?
> {code:r}
> ds <- open_dataset(.....)
> list_of_files <- ds$files
> {code}
> This works, but I need the file_name as a column.
> Thanks
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)