[
https://issues.apache.org/jira/browse/ARROW-15260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17469854#comment-17469854
]
Martin du Toit commented on ARROW-15260:
----------------------------------------
The raw data that we receive from clients is structured into various folders.
We partition the data based on the folders. The lowest level is a timestamp
folder, but there are scenarios where we receive multiple files for a specific
timestamp. In order to process the data, in bulk, we need to create a unique
row level id for for each file, i.e. group_by various partitions and file_name
to add a row_number . If we pickup any issues with the data, we need to be able
to pinpoint the exact file where the issue occurred to revert back to the
client.
I hope this makes sense
> [R] open_dataset - add file_name as column
> ------------------------------------------
>
> Key: ARROW-15260
> URL: https://issues.apache.org/jira/browse/ARROW-15260
> Project: Apache Arrow
> Issue Type: New Feature
> Components: R
> Reporter: Martin du Toit
> Priority: Minor
>
> Hi. Is it possible to add the file_name as a column to a dataset?
> {code:r}
> ds <- open_dataset(.....)
> list_of_files <- ds$files
> {code}
> This works, but I need the file_name as a column.
> Thanks
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)