[
https://issues.apache.org/jira/browse/ARROW-14730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17469492#comment-17469492
]
Will Jones commented on ARROW-14730:
------------------------------------
I spent some time with the delta-rs Python library and made [a PR to improve
the dataset factory there|https://github.com/delta-io/delta-rs/pull/525]. The
API we expose actually made this quite easy to build. I think it should be easy
to create a C++ and R bindings in delta-rs.
The one catch is that, as currently implemented, the delta log is read using
the Rust Arrow parquet and json readers, while the table files are read through
the C++ ones. I don't foresee that being a problem for Python or R users,
though I could see C++ users objecting to that complexity. If we wanted to, we
could do what [~houqp] suggests above and separate out IO so that the C++
implementation uses it's own IO code.
IMO I think we have a decent path forward within the delta-rs repo. If there
are no objections, I can create follow-up issues in that repo and close this
issue.
> [C++][R][Python] Support reading from Delta Lake tables
> -------------------------------------------------------
>
> Key: ARROW-14730
> URL: https://issues.apache.org/jira/browse/ARROW-14730
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Will Jones
> Priority: Major
>
> [Delta Lake|https://delta.io/] is a parquet table format that supports ACID
> transactions. It's popularized by Databricks, which uses it as the default
> table format in their platform. Previously, it's only been readable from
> Spark, but now there is an effort in
> [delta-rs|https://github.com/delta-io/delta-rs] to make it accessible from
> elsewhere. There is already some integration with DataFusion (see:
> https://github.com/apache/arrow-datafusion/issues/525).
> There does already exist [a method to read Delta Lake tables into Arrow
> tables in
> Python|https://delta-io.github.io/delta-rs/python/api_reference.html#deltalake.table.DeltaTable.to_pyarrow_table]
> in the delta-rs Python bindings. This includes filtering by partitions.
> Is there a good way we could integrate this functionality with Arrow C++
> Dataset and expose that in Python and R? Would that be something that should
> be implemented in Arrow libraries or in delta-rs?
--
This message was sent by Atlassian Jira
(v8.20.1#820001)