[
https://issues.apache.org/jira/browse/ARROW-15260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17470024#comment-17470024
]
Weston Pace commented on ARROW-15260:
-------------------------------------
>From a C++ perspective we've got many of the pieces needed already. One
>challenge is that the datasets API is written to work with "fragments" and not
>"files". For example, a dataset might be an in-memory table in which case we
>are working with InMemoryFragment and not FileFragment so there is no concept
>of "filename".
That being said, the low level ScanBatchesAsync method actually returns a
generator of TaggedRecordBatch for this very purpose. A TaggedRecordBatch is a
struct with the record batch as well as the source fragment for that record
batch.
So if you were to execute scan, you could inspect the fragment and, if it is a
FileFragment, you could extract the filename.
Another challenge is that R is moving towards more and more access through an
exec plan and not directly using a scanner. In order for that to work we would
need to augment the scan results with the filename in C++ before sending into
the exec plan. Luckily, we already do this a bit as well. We currently
augment the scan results with fragment index, batch index, and whether the
batch is the last batch in the fragment.
Since ExecBatch can work with constants efficiently I don't think there will be
much performance cost in always including the filename. So the work remaining
is simply to add a new augmented field __fragment_source_name which is always
attached if the underlying fragment is a filename. Then users can get this
field if they want by including "__fragment_source_name" in the list of columns
they query for.
> [R] open_dataset - add file_name as column
> ------------------------------------------
>
> Key: ARROW-15260
> URL: https://issues.apache.org/jira/browse/ARROW-15260
> Project: Apache Arrow
> Issue Type: New Feature
> Components: R
> Reporter: Martin du Toit
> Priority: Minor
>
> Hi. Is it possible to add the file_name as a column to a dataset?
> {code:r}
> ds <- open_dataset(.....)
> list_of_files <- ds$files
> {code}
> This works, but I need the file_name as a column.
> Thanks
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)