[
https://issues.apache.org/jira/browse/ARROW-15317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17486673#comment-17486673
]
Dewey Dunnington commented on ARROW-15317:
------------------------------------------
The use-case I had in mind is read-only...a user wants to query a dataset that
somebody has provided as a few thousand shapefiles. If we're in another R
package (which we should be for something like this), we'd need C linkage but
there's no need for writing or filter expressions (Will has a good point that
Substrait would let you provide one). The existing C ABI would let you do `int
get_arrow_array_stream(const char* key, struct ArrowArrayStream* result, struct
ErrorInfo* error_info)`...I would be using Arrow for its own awesome filtering
instead of trying to provide it myself. Perhaps that's far too simplistic and
perhaps this is straying too far from exposing the {{Fragment}} in the R
package...I'm new to all of this!
> [R] Expose API to create Dataset from Fragments
> -----------------------------------------------
>
> Key: ARROW-15317
> URL: https://issues.apache.org/jira/browse/ARROW-15317
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Affects Versions: 6.0.1
> Reporter: Will Jones
> Priority: Minor
>
> Third-party packages may define dataset factories for table formats like
> Delta Lake and Apache Iceberg. These formats store metadata like schema, file
> lists, and file-level statistics on the side, and can construct a dataset
> without a discovery process needed. Python exposed enough API to do this
> successfully for [a Delta Lake dataset reader
> here|https://github.com/delta-io/delta-rs/blob/6a8195d6e3cbdcb0c58a14a3ffccc472dd094de0/python/deltalake/table.py#L267-L280].
> I propose adding the following to the R API:
> * Expose {{Fragment}} as an R6 object
> * Add the {{MakeFragment}} method to various file format objects. It's key
> that {{partition_expression}} is included as an argument. ([See Python
> equivalent
> here|https://github.com/apache/arrow/blob/ab86daf3f7c8a67bee6a175a749575fd40417d27/python/pyarrow/_dataset_parquet.pyx#L209-L210])
> * Add a dataset constructor that takes a list of {{Fragments}}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)