[ 
https://issues.apache.org/jira/browse/ARROW-15260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17469854#comment-17469854
 ] 

Martin du Toit commented on ARROW-15260:
----------------------------------------

The raw data that we receive from clients is structured into various folders. 
We partition the data based on the folders. The lowest level is a timestamp 
folder, but there are scenarios where we receive multiple files for a specific 
timestamp. In order to process the data, in bulk, we need to create a unique 
row level id for for each file, i.e. group_by various partitions and file_name 
to add a row_number . If we pickup any issues with the data, we need to be able 
to pinpoint the exact file where the issue occurred to revert back to the 
client.

I hope this makes sense

> [R] open_dataset - add file_name as column
> ------------------------------------------
>
>                 Key: ARROW-15260
>                 URL: https://issues.apache.org/jira/browse/ARROW-15260
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: R
>            Reporter: Martin du Toit
>            Priority: Minor
>
> Hi. Is it possible to add the file_name as a column to a dataset?
> {code:r}
> ds <- open_dataset(.....)
> list_of_files <- ds$files
> {code}
> This works, but I need the file_name as a column.
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to