[
https://issues.apache.org/jira/browse/ARROW-15281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated ARROW-15281:
-----------------------------------
Labels: dataset pull-request-available query-engine (was: dataset
query-engine)
> [C++] Implement ability to retrieve fragment filename
> -----------------------------------------------------
>
> Key: ARROW-15281
> URL: https://issues.apache.org/jira/browse/ARROW-15281
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Nicola Crane
> Assignee: Sanjiban Sengupta
> Priority: Major
> Labels: dataset, pull-request-available, query-engine
> Time Spent: 10m
> Remaining Estimate: 0h
>
> A user has requested the ability to include the filename of the CSV in the
> dataset output - see discussion on ARROW-15260 for more context.
> Relevant info from that ticket:
>
> "From a C++ perspective we've got many of the pieces needed already. One
> challenge is that the datasets API is written to work with "fragments" and
> not "files". For example, a dataset might be an in-memory table in which case
> we are working with InMemoryFragment and not FileFragment so there is no
> concept of "filename".
> That being said, the low level ScanBatchesAsync method actually returns a
> generator of TaggedRecordBatch for this very purpose. A TaggedRecordBatch is
> a struct with the record batch as well as the source fragment for that record
> batch.
> So if you were to execute scan, you could inspect the fragment and, if it is
> a FileFragment, you could extract the filename.
> Another challenge is that R is moving towards more and more access through an
> exec plan and not directly using a scanner. In order for that to work we
> would need to augment the scan results with the filename in C++ before
> sending into the exec plan. Luckily, we already do this a bit as well. We
> currently augment the scan results with fragment index, batch index, and
> whether the batch is the last batch in the fragment.
> Since ExecBatch can work with constants efficiently I don't think there will
> be much performance cost in always including the filename. So the work
> remaining is simply to add a new augmented field _{_}fragment_source_name
> which is always attached if the underlying fragment is a filename. Then users
> can get this field if they want by including "{_}_fragment_source_name" in
> the list of columns they query for."
--
This message was sent by Atlassian Jira
(v8.20.1#820001)