[jira] [Commented] (ARROW-15317) [R] Expose API to create Dataset from Fragments

Will Jones (Jira) Mon, 31 Jan 2022 18:44:06 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-15317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17485028#comment-17485028
 ]


Will Jones commented on ARROW-15317:
------------------------------------

{quote}If we go this route are we effectively defining yet another table 
format? Albeit a rather limited one.
{quote}
I think of datasets as lower-level than a table format, and in my experience 
the files reader / writer is decoupled from the table format reader / writer. 
Table formats implement:
 * A serialization of ACID transaction information, and a protocol for how to 
handle writes
 * Metadata (e.g. table name, description) storage, including possible 
integration with Data Catalogs (e.g. AWS Glue)
 * Protocols for table maintenance (cleaning up old files, compacting files)
 * Table versioning / time travel

That's all very different than the scope of datasets, right?

The path I'm experimenting with right now is implementing a reader and writer 
for Delta Lake on top of datasets within delta-rs/python. That's been pretty 
doable with the reader, and seems like it wouldn't require [that many changes 
for a writer|https://github.com/delta-io/delta-rs/issues/542#issue-1099890585].

{quote}Also wandering along this path you also might brainstorm/encounter "A 
stable C ABI for datasets".
{quote}

I think that would be awesome, particularly for delta-rs or any other Rust 
project. Though tough part would be expressions, which maybe is solved by 
substrait?

> [R] Expose API to create Dataset from Fragments
> -----------------------------------------------
>
>                 Key: ARROW-15317
>                 URL: https://issues.apache.org/jira/browse/ARROW-15317
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>    Affects Versions: 6.0.1
>            Reporter: Will Jones
>            Priority: Minor
>
> Third-party packages may define dataset factories for table formats like 
> Delta Lake and Apache Iceberg. These formats store metadata like schema, file 
> lists, and file-level statistics on the side, and can construct a dataset 
> without a discovery process needed. Python exposed enough API to do this 
> successfully for [a Delta Lake dataset reader 
> here|https://github.com/delta-io/delta-rs/blob/6a8195d6e3cbdcb0c58a14a3ffccc472dd094de0/python/deltalake/table.py#L267-L280].
> I propose adding the following to the R API:
>  * Expose {{Fragment}} as an R6 object
>  * Add the {{MakeFragment}} method to various file format objects. It's key 
> that {{partition_expression}} is included as an argument. ([See Python 
> equivalent 
> here|https://github.com/apache/arrow/blob/ab86daf3f7c8a67bee6a175a749575fd40417d27/python/pyarrow/_dataset_parquet.pyx#L209-L210])
>  * Add a dataset constructor that takes a list of {{Fragments}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15317) [R] Expose API to create Dataset from Fragments

Reply via email to