[jira] [Commented] (ARROW-15317) [R] Expose API to create Dataset from Fragments

Will Jones (Jira) Mon, 31 Jan 2022 15:43:06 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-15317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17485000#comment-17485000
 ]


Will Jones commented on ARROW-15317:
------------------------------------

Well to be clear I table formats are not the same as file formats; they just 
are a standard for metadata on a parquet (or other format) table. Basically 
they hold a cache of all the useful dataset discovery information (in 
particular, the list of files currently part of the table and their partition 
values), and to use them with datasets we need a way to just build one from 
pre-created fragments.

I think for new file formats, it's not clear those are pluggable. But I could 
see a world where a third-party package implements some FileFormat object that 
holds the implementation for reading a single fragment of the data into Arrow.

> [R] Expose API to create Dataset from Fragments
> -----------------------------------------------
>
>                 Key: ARROW-15317
>                 URL: https://issues.apache.org/jira/browse/ARROW-15317
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>    Affects Versions: 6.0.1
>            Reporter: Will Jones
>            Priority: Minor
>
> Third-party packages may define dataset factories for table formats like 
> Delta Lake and Apache Iceberg. These formats store metadata like schema, file 
> lists, and file-level statistics on the side, and can construct a dataset 
> without a discovery process needed. Python exposed enough API to do this 
> successfully for [a Delta Lake dataset reader 
> here|https://github.com/delta-io/delta-rs/blob/6a8195d6e3cbdcb0c58a14a3ffccc472dd094de0/python/deltalake/table.py#L267-L280].
> I propose adding the following to the R API:
>  * Expose {{Fragment}} as an R6 object
>  * Add the {{MakeFragment}} method to various file format objects. It's key 
> that {{partition_expression}} is included as an argument. ([See Python 
> equivalent 
> here|https://github.com/apache/arrow/blob/ab86daf3f7c8a67bee6a175a749575fd40417d27/python/pyarrow/_dataset_parquet.pyx#L209-L210])
>  * Add a dataset constructor that takes a list of {{Fragments}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15317) [R] Expose API to create Dataset from Fragments

Reply via email to