[GitHub] [arrow-site] thisisnic commented on a diff in pull request #300: [Website] Version 11.0.0 blog post
thisisnic commented on code in PR #300: URL: https://github.com/apache/arrow-site/pull/300#discussion_r1082814781 ## _posts/2023-01-18-11.0.0-release.md: ## @@ -0,0 +1,119 @@ +--- +layout: post +title: "Apache Arrow 11.0.0 Release" +date: "2023-01-18 00:00:00" +author: pmc +categories: [release] +--- + + + +The Apache Arrow team is pleased to announce the 11.0.0 release. This covers +over 3 months of development work and includes [**423 resolved issues**][1] +from [**95 distinct contributors**][2]. See the [Install Page](https://arrow.apache.org/install/) +to learn how to get the libraries for your platform. + +The release notes below are not exhaustive and only expose selected highlights +of the release. Many other bugfixes and improvements have been made: we refer +you to the [complete changelog][3]. + +## Community + +Since the 10.0.0 release, Ben Baumgold, Will Jones, Eric Patrick Hanson, +Curtis Vogt, Yang Jiang, Jarrett Revels, Raúl Cumplido, Jacob Wujciak, +Jie Wen and Brent Gardner have been invited to be committers. +Kun Liu have joined the Project Management Committee (PMC). + +As per our newly started tradition of rotating the PMC chair once a year +Andrew Lamb was elected as the new PMC chair and VP. + +Thanks for your contributions and participation in the project! + +## Columnar Format Notes + +## Arrow Flight RPC notes + +In the C++/Python Flight clients, DoAction now properly streams the results, instead of blocking until the call finishes. Applications that did not consume the iterator before should fully consume the result. ([#15069](https://github.com/apache/arrow/issues/15069)) + +## C++ notes + +## C# notes + +No major changes to C#. + +## Go notes +* Go's benchmarks will now get added to [Conbench](https://conbench.ursa.dev) alongside the benchmarks for other implementations (GH-32983)[https://github.com/apache/arrow/issues/32983] +* Exposed FlightService_ServiceDesc and RegisterFlightServiceServer to allow easily incorporating a flight service into an existing gRPC server (GH-15174)[https://github.com/apache/arrow/issues/15174] + +### Arrow +* Function `ApproxEquals` was implemented for scalar values (GH-29581)[https://github.com/apache/arrow/issues/29581] +* `UnmarshalJSON` for the `RecordBuilder` now properly handles extra unknown fields with complex/nested values (GH-31840)[https://github.com/apache/arrow/issues/31840] +* Decimal128 and Decimal256 type support has been added to the CSV reader (GH-33111)[https://github.com/apache/arrow/issues/33111] +* Fixed bug in `array.UnionBuilder` where `Len` method always returned 0 (GH-14775)[https://github.com/apache/arrow/issues/14775] +* Fixed bug for handling slices of Map arrays when marshalling to JSON and for IPC (GH-14780)[https://github.com/apache/arrow/issues/14780] +* Fixed memory leak when compressing IPC message body buffers (GH-14883)[https://github.com/apache/arrow/issues/14883] +* Added the ability to easily append scalar values to array builders (GH-15005)[https://github.com/apache/arrow/issues/15005] + + Compute +* Scalar binary (add/subtract/multiply/divide/etc.) and unary arithmetic (abs/neg/sqrt/sign/etc.) has been implemented for the compute package (GH-33086)[https://github.com/apache/arrow/issues/33086] this includes easy functions like `compute.Add` and `compute.Divide` etc. +* Scalar boolean functions like AND/OR/XOR/etc. have been implemented for compute (GH-33279)[https://github.com/apache/arrow/issues/33279] +* Scalar comparison function kernels have been implemented for compute (equal/greater/greater_equal/less/less_equal) (GH-33308)[https://github.com/apache/arrow/issues/33308] +* Scalar compute functions are compatible with dictionary encoded arrays by casting them to their value types (GH-33502)[https://github.com/apache/arrow/issues/33502] + +### Parquet +* Panic when decoding a delta_bit_packed encoded column has been fixed (GH-33483)[https://github.com/apache/arrow/issues/33483] +* Fixed memory leak from Allocator in `pqarrow.WriteArrowToColumn` (GH-14865)[https://github.com/apache/arrow/issues/14865] +* Fixed `writer.WriteBatch` to properly handle writing encrypted parquet columns and no longer silently fail, but instead propagate an error (GH-14940)[https://github.com/apache/arrow/issues/14940] + +## Java notes + +## JavaScript notes + +* Bugfixes and dependency updates. +* Arrow now requires BigInt support. [GH-33681](https://github.com/apache/arrow/pull/33682) + +## Python notes + +New features and improvements: + +* Numpy conversion for ListArray is improved taking into account sliced offset [(GH-20512)](https://github.com/apache/arrow/issues/20512) +* DataFrame Interchange Protocol is implemented for ``pyarrow.Table`` ([GH-33346](https://github.com/apache/arrow/issues/33346)). + +## R notes + +For more on what’s in the 11.0.0 R package, see the [R changelog][4]. Review Comment: ```suggestion * map_batches() is lazy by default; it now
[GitHub] [arrow-site] raulcd commented on a diff in pull request #300: [Website] Version 11.0.0 blog post
raulcd commented on code in PR #300: URL: https://github.com/apache/arrow-site/pull/300#discussion_r1082800466 ## _posts/2023-01-18-11.0.0-release.md: ## @@ -0,0 +1,82 @@ +--- +layout: post +title: "Apache Arrow 11.0.0 Release" +date: "2023-01-18 00:00:00" +author: pmc +categories: [release] +--- + + + +The Apache Arrow team is pleased to announce the 11.0.0 release. This covers +over 3 months of development work and includes [**423 resolved issues**][1] +from [**95 distinct contributors**][2]. See the [Install Page](https://arrow.apache.org/install/) +to learn how to get the libraries for your platform. + +The release notes below are not exhaustive and only expose selected highlights +of the release. Many other bugfixes and improvements have been made: we refer +you to the [complete changelog][3]. + +## Community + +Since the 10.0.0 release, Ben Baumgold, Will Jones, Eric Patrick Hanson, +Curtis Vogt, Yang Jiang, Jarrett Revels, Raúl Cumplido, Jacob Wujciak, +Jie Wen and Brent Gardner have been invited to be committers. +Kun Liu have joined the Project Management Committee (PMC). Review Comment: Thanks, I've added a note about the new PMC chair. I've mainly taken it from the announcement email. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] alamb merged pull request #294: [WEBSITE] DataFusion 16.0.0 blog post
alamb merged PR #294: URL: https://github.com/apache/arrow-site/pull/294 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] kou commented on a diff in pull request #300: [Website] Version 11.0.0 blog post
kou commented on code in PR #300: URL: https://github.com/apache/arrow-site/pull/300#discussion_r1080717581 ## _posts/2023-01-18-11.0.0-release.md: ## @@ -0,0 +1,82 @@ +--- +layout: post +title: "Apache Arrow 11.0.0 Release" +date: "2023-01-18 00:00:00" +author: pmc +categories: [release] +--- + + + +The Apache Arrow team is pleased to announce the 11.0.0 release. This covers +over 3 months of development work and includes [**423 resolved issues**][1] +from [**95 distinct contributors**][2]. See the [Install Page](https://arrow.apache.org/install/) +to learn how to get the libraries for your platform. + +The release notes below are not exhaustive and only expose selected highlights +of the release. Many other bugfixes and improvements have been made: we refer +you to the [complete changelog][3]. + +## Community + +Since the 10.0.0 release, Ben Baumgold, Will Jones, Eric Patrick Hanson, +Curtis Vogt, Yang Jiang, Jarrett Revels, Raúl Cumplido, Jacob Wujciak, +Jie Wen and Brent Gardner have been invited to be committers. +Kun Liu have joined the Project Management Committee (PMC). Review Comment: > Could you help me validate that? Valid! > Also, let me know if you want me to add a note about the PMC rotation here. Yes, please. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] alamb commented on a diff in pull request #299: MINOR: [Website] Reword ADBC announcement
alamb commented on code in PR #299: URL: https://github.com/apache/arrow-site/pull/299#discussion_r1073941041 ## _posts/2023-01-05-introducing-arrow-adbc.md: ## @@ -144,7 +144,7 @@ ADBC fills a specific niche that related projects do not address. It is both: Vendor-neutral (database APIs) - Vendor-specific (database protocols) + Database protocols Review Comment: Maybe a better phrase would be "Database specific protocols" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] domoritz commented on a diff in pull request #300: [Website] Version 11.0.0 blog post
domoritz commented on code in PR #300: URL: https://github.com/apache/arrow-site/pull/300#discussion_r1073916810 ## _posts/2023-01-18-11.0.0-release.md: ## @@ -0,0 +1,82 @@ +--- +layout: post +title: "Apache Arrow 11.0.0 Release" +date: "2023-01-18 00:00:00" +author: pmc +categories: [release] +--- + + + +The Apache Arrow team is pleased to announce the 11.0.0 release. This covers +over 3 months of development work and includes [**423 resolved issues**][1] +from [**95 distinct contributors**][2]. See the [Install Page](https://arrow.apache.org/install/) +to learn how to get the libraries for your platform. + +The release notes below are not exhaustive and only expose selected highlights +of the release. Many other bugfixes and improvements have been made: we refer +you to the [complete changelog][3]. + +## Community + +Since the 10.0.0 release, Ben Baumgold, Will Jones, Eric Patrick Hanson, +Curtis Vogt, Yang Jiang, Jarrett Revels, Raúl Cumplido, Jacob Wujciak, +Jie Wen and Brent Gardner have been invited to be committers. +Kun Liu have joined the Project Management Committee (PMC). + +Thanks for your contributions and participation in the project! + +## Columnar Format Notes + +## Arrow Flight RPC notes + +## C++ notes + +## C# notes + +## Go notes + +## Java notes + +## JavaScript notes Review Comment: ```suggestion ## JavaScript notes * Bugfixes and dependency updates. * Arrow now requires BigInt support. [GH-33681](https://github.com/apache/arrow/pull/33682) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] eerhardt commented on a diff in pull request #300: [Website] Version 11.0.0 blog post
eerhardt commented on code in PR #300: URL: https://github.com/apache/arrow-site/pull/300#discussion_r1073859493 ## _posts/2023-01-18-11.0.0-release.md: ## @@ -0,0 +1,82 @@ +--- +layout: post +title: "Apache Arrow 11.0.0 Release" +date: "2023-01-18 00:00:00" +author: pmc +categories: [release] +--- + + + +The Apache Arrow team is pleased to announce the 11.0.0 release. This covers +over 3 months of development work and includes [**423 resolved issues**][1] +from [**95 distinct contributors**][2]. See the [Install Page](https://arrow.apache.org/install/) +to learn how to get the libraries for your platform. + +The release notes below are not exhaustive and only expose selected highlights +of the release. Many other bugfixes and improvements have been made: we refer +you to the [complete changelog][3]. + +## Community + +Since the 10.0.0 release, Ben Baumgold, Will Jones, Eric Patrick Hanson, +Curtis Vogt, Yang Jiang, Jarrett Revels, Raúl Cumplido, Jacob Wujciak, +Jie Wen and Brent Gardner have been invited to be committers. +Kun Liu have joined the Project Management Committee (PMC). + +Thanks for your contributions and participation in the project! + +## Columnar Format Notes + +## Arrow Flight RPC notes + +## C++ notes + +## C# notes Review Comment: No, there hasn't been any C# changes of note in 11.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] zeroshade commented on a diff in pull request #300: [Website] Version 11.0.0 blog post
zeroshade commented on code in PR #300: URL: https://github.com/apache/arrow-site/pull/300#discussion_r1073769521 ## _posts/2023-01-18-11.0.0-release.md: ## @@ -0,0 +1,82 @@ +--- +layout: post +title: "Apache Arrow 11.0.0 Release" +date: "2023-01-18 00:00:00" +author: pmc +categories: [release] +--- + + + +The Apache Arrow team is pleased to announce the 11.0.0 release. This covers +over 3 months of development work and includes [**423 resolved issues**][1] +from [**95 distinct contributors**][2]. See the [Install Page](https://arrow.apache.org/install/) +to learn how to get the libraries for your platform. + +The release notes below are not exhaustive and only expose selected highlights +of the release. Many other bugfixes and improvements have been made: we refer +you to the [complete changelog][3]. + +## Community + +Since the 10.0.0 release, Ben Baumgold, Will Jones, Eric Patrick Hanson, +Curtis Vogt, Yang Jiang, Jarrett Revels, Raúl Cumplido, Jacob Wujciak, +Jie Wen and Brent Gardner have been invited to be committers. +Kun Liu have joined the Project Management Committee (PMC). + +Thanks for your contributions and participation in the project! + +## Columnar Format Notes + +## Arrow Flight RPC notes + +## C++ notes + +## C# notes + +## Go notes Review Comment: ```suggestion ## Go notes * Go's benchmarks will now get added to [Conbench](https://conbench.ursa.dev) alongside the benchmarks for other implementations (GH-32983)[https://github.com/apache/arrow/issues/32983] * Exposed FlightService_ServiceDesc and RegisterFlightServiceServer to allow easily incorporating a flight service into an existing gRPC server (GH-15174)[https://github.com/apache/arrow/issues/15174] ### Arrow * Function `ApproxEquals` was implemented for scalar values (GH-29581)[https://github.com/apache/arrow/issues/29581] * `UnmarshalJSON` for the `RecordBuilder` now properly handles extra unknown fields with complex/nested values (GH-31840)[https://github.com/apache/arrow/issues/31840] * Decimal128 and Decimal256 type support has been added to the CSV reader (GH-33111)[https://github.com/apache/arrow/issues/33111] * Fixed bug in `array.UnionBuilder` where `Len` method always returned 0 (GH-14775)[https://github.com/apache/arrow/issues/14775] * Fixed bug for handling slices of Map arrays when marshalling to JSON and for IPC (GH-14780)[https://github.com/apache/arrow/issues/14780] * Fixed memory leak when compressing IPC message body buffers (GH-14883)[https://github.com/apache/arrow/issues/14883] * Added the ability to easily append scalar values to array builders (GH-15005)[https://github.com/apache/arrow/issues/15005] Compute * Scalar binary (add/subtract/multiply/divide/etc.) and unary arithmetic (abs/neg/sqrt/sign/etc.) has been implemented for the compute package (GH-33086)[https://github.com/apache/arrow/issues/33086] this includes easy functions like `compute.Add` and `compute.Divide` etc. * Scalar boolean functions like AND/OR/XOR/etc. have been implemented for compute (GH-33279)[https://github.com/apache/arrow/issues/33279] * Scalar comparison function kernels have been implemented for compute (equal/greater/greater_equal/less/less_equal) (GH-33308)[https://github.com/apache/arrow/issues/33308] * Scalar compute functions are compatible with dictionary encoded arrays by casting them to their value types (GH-33502)[https://github.com/apache/arrow/issues/33502] ### Parquet * Panic when decoding a delta_bit_packed encoded column has been fixed (GH-33483)[https://github.com/apache/arrow/issues/33483] * Fixed memory leak from Allocator in `pqarrow.WriteArrowToColumn` (GH-14865)[https://github.com/apache/arrow/issues/14865] * Fixed `writer.WriteBatch` to properly handle writing encrypted parquet columns and no longer silently fail, but instead propagate an error (GH-14940)[https://github.com/apache/arrow/issues/14940] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] lidavidm commented on a diff in pull request #300: [Website] Version 11.0.0 blog post
lidavidm commented on code in PR #300: URL: https://github.com/apache/arrow-site/pull/300#discussion_r1073512952 ## _posts/2023-01-18-11.0.0-release.md: ## @@ -0,0 +1,82 @@ +--- +layout: post +title: "Apache Arrow 11.0.0 Release" +date: "2023-01-18 00:00:00" +author: pmc +categories: [release] +--- + + + +The Apache Arrow team is pleased to announce the 11.0.0 release. This covers +over 3 months of development work and includes [**423 resolved issues**][1] +from [**95 distinct contributors**][2]. See the [Install Page](https://arrow.apache.org/install/) +to learn how to get the libraries for your platform. + +The release notes below are not exhaustive and only expose selected highlights +of the release. Many other bugfixes and improvements have been made: we refer +you to the [complete changelog][3]. + +## Community + +Since the 10.0.0 release, Ben Baumgold, Will Jones, Eric Patrick Hanson, +Curtis Vogt, Yang Jiang, Jarrett Revels, Raúl Cumplido, Jacob Wujciak, +Jie Wen and Brent Gardner have been invited to be committers. +Kun Liu have joined the Project Management Committee (PMC). + +Thanks for your contributions and participation in the project! + +## Columnar Format Notes + +## Arrow Flight RPC notes Review Comment: ```suggestion ## Arrow Flight RPC notes In the C++/Python Flight clients, DoAction now properly streams the results, instead of blocking until the call finishes. Applications that did not consume the iterator before should fully consume the result. ([#15069](https://github.com/apache/arrow/issues/15069)) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] AlenkaF commented on a diff in pull request #300: [Website] Version 11.0.0 blog post
AlenkaF commented on code in PR #300: URL: https://github.com/apache/arrow-site/pull/300#discussion_r1073474010 ## _posts/2023-01-18-11.0.0-release.md: ## @@ -0,0 +1,82 @@ +--- +layout: post +title: "Apache Arrow 11.0.0 Release" +date: "2023-01-18 00:00:00" +author: pmc +categories: [release] +--- + + + +The Apache Arrow team is pleased to announce the 11.0.0 release. This covers +over 3 months of development work and includes [**423 resolved issues**][1] +from [**95 distinct contributors**][2]. See the [Install Page](https://arrow.apache.org/install/) +to learn how to get the libraries for your platform. + +The release notes below are not exhaustive and only expose selected highlights +of the release. Many other bugfixes and improvements have been made: we refer +you to the [complete changelog][3]. + +## Community + +Since the 10.0.0 release, Ben Baumgold, Will Jones, Eric Patrick Hanson, +Curtis Vogt, Yang Jiang, Jarrett Revels, Raúl Cumplido, Jacob Wujciak, +Jie Wen and Brent Gardner have been invited to be committers. +Kun Liu have joined the Project Management Committee (PMC). + +Thanks for your contributions and participation in the project! + +## Columnar Format Notes + +## Arrow Flight RPC notes + +## C++ notes + +## C# notes + +## Go notes + +## Java notes + +## JavaScript notes + +## Python notes Review Comment: ```suggestion ## Python notes New features and improvements: * Numpy conversion for ListArray is improved taking into account sliced offset [(GH-20512)](https://github.com/apache/arrow/issues/20512) * DataFrame Interchange Protocol is implemented for ``pyarrow.Table`` ([GH-33346](https://github.com/apache/arrow/issues/33346)). ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] raulcd commented on a diff in pull request #300: [Website] Version 11.0.0 blog post
raulcd commented on code in PR #300: URL: https://github.com/apache/arrow-site/pull/300#discussion_r1073442447 ## _posts/2023-01-18-11.0.0-release.md: ## @@ -0,0 +1,89 @@ +--- +layout: post +title: "Apache Arrow 11.0.0 Release" +date: "2023-01-18 00:00:00" +author: pmc +categories: [release] +--- + + + +The Apache Arrow team is pleased to announce the 11.0.0 release. This covers +over 3 months of development work and includes [**423 resolved issues**][1] +from [**95 distinct contributors**][2]. See the [Install Page](https://arrow.apache.org/install/) Review Comment: these numbers might vary with final release. @raulcd to validate before publishing. ## _posts/2023-01-18-11.0.0-release.md: ## @@ -0,0 +1,82 @@ +--- +layout: post +title: "Apache Arrow 11.0.0 Release" +date: "2023-01-18 00:00:00" +author: pmc +categories: [release] +--- + + + +The Apache Arrow team is pleased to announce the 11.0.0 release. This covers +over 3 months of development work and includes [**423 resolved issues**][1] +from [**95 distinct contributors**][2]. See the [Install Page](https://arrow.apache.org/install/) +to learn how to get the libraries for your platform. + +The release notes below are not exhaustive and only expose selected highlights +of the release. Many other bugfixes and improvements have been made: we refer +you to the [complete changelog][3]. + +## Community + +Since the 10.0.0 release, Ben Baumgold, Will Jones, Eric Patrick Hanson, +Curtis Vogt, Yang Jiang, Jarrett Revels, Raúl Cumplido, Jacob Wujciak, +Jie Wen and Brent Gardner have been invited to be committers. +Kun Liu have joined the Project Management Committee (PMC). + +Thanks for your contributions and participation in the project! + +## Columnar Format Notes + +## Arrow Flight RPC notes + +## C++ notes Review Comment: @pitrou can you help with the notes? ## _posts/2023-01-18-11.0.0-release.md: ## @@ -0,0 +1,82 @@ +--- +layout: post +title: "Apache Arrow 11.0.0 Release" +date: "2023-01-18 00:00:00" +author: pmc +categories: [release] +--- + + + +The Apache Arrow team is pleased to announce the 11.0.0 release. This covers +over 3 months of development work and includes [**423 resolved issues**][1] +from [**95 distinct contributors**][2]. See the [Install Page](https://arrow.apache.org/install/) +to learn how to get the libraries for your platform. + +The release notes below are not exhaustive and only expose selected highlights +of the release. Many other bugfixes and improvements have been made: we refer +you to the [complete changelog][3]. + +## Community + +Since the 10.0.0 release, Ben Baumgold, Will Jones, Eric Patrick Hanson, +Curtis Vogt, Yang Jiang, Jarrett Revels, Raúl Cumplido, Jacob Wujciak, +Jie Wen and Brent Gardner have been invited to be committers. +Kun Liu have joined the Project Management Committee (PMC). + +Thanks for your contributions and participation in the project! + +## Columnar Format Notes + +## Arrow Flight RPC notes + +## C++ notes + +## C# notes + +## Go notes + +## Java notes + +## JavaScript notes Review Comment: @domoritz any notes for the 11.0.0 release? ## _posts/2023-01-18-11.0.0-release.md: ## @@ -0,0 +1,82 @@ +--- +layout: post +title: "Apache Arrow 11.0.0 Release" +date: "2023-01-18 00:00:00" +author: pmc +categories: [release] +--- + + + +The Apache Arrow team is pleased to announce the 11.0.0 release. This covers +over 3 months of development work and includes [**423 resolved issues**][1] +from [**95 distinct contributors**][2]. See the [Install Page](https://arrow.apache.org/install/) +to learn how to get the libraries for your platform. + +The release notes below are not exhaustive and only expose selected highlights +of the release. Many other bugfixes and improvements have been made: we refer +you to the [complete changelog][3]. + +## Community + +Since the 10.0.0 release, Ben Baumgold, Will Jones, Eric Patrick Hanson, +Curtis Vogt, Yang Jiang, Jarrett Revels, Raúl Cumplido, Jacob Wujciak, +Jie Wen and Brent Gardner have been invited to be committers. +Kun Liu have joined the Project Management Committee (PMC). + +Thanks for your contributions and participation in the project! + +## Columnar Format Notes + +## Arrow Flight RPC notes Review Comment: @lidavidm can you help with the release notes? ## _posts/2023-01-18-11.0.0-release.md: ## @@ -0,0 +1,82 @@ +--- +layout: post +title: "Apache Arrow 11.0.0 Release" +date: "2023-01-18 00:00:00" +author: pmc +categories: [release] +--- + + + +The Apache Arrow team is pleased to announce the 11.0.0 release. This covers +over 3 months of development work and includes [**423 resolved issues**][1] +from [**95 distinct contributors**][2]. See the [Install Page](https://arrow.apache.org/install/) +to learn how to get the libraries for your platform. + +The release notes below are not exhaustive and only expose selected highlights +of the release. Many other
[GitHub] [arrow-site] github-actions[bot] commented on pull request #300: [Website] Version 11.0.0 blog post
github-actions[bot] commented on PR #300: URL: https://github.com/apache/arrow-site/pull/300#issuecomment-1386935967 Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Then could you also rename pull request title in the following format? ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY} See also: * [Other pull requests](https://github.com/apache/arrow-site/pulls/) * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] raulcd opened a new pull request, #300: [Website] Version 11.0.0 blog post
raulcd opened a new pull request, #300: URL: https://github.com/apache/arrow-site/pull/300 PR to start adding the blog post information for the Release 11.0.0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] lidavidm commented on a diff in pull request #299: MINOR: [Website] Reword ADBC announcement
lidavidm commented on code in PR #299: URL: https://github.com/apache/arrow-site/pull/299#discussion_r1072916029 ## _posts/2023-01-05-introducing-arrow-adbc.md: ## @@ -66,10 +66,10 @@ Developers have a few options: Libraries like [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc] handle row-to-columnar conversions for clients. But this doesn't fundamentally solve the problem. Unnecessary data conversions are still required. -- *Use vendor-specific protocols*. - For some databases, applications can use a database-specific protocol or SDK to directly get Arrow data. - For example, applications could use Dremio via [Arrow Flight SQL][flight-sql]. - But client applications that want to support multiple database vendors would need to integrate with each of them. +- *Directly use database protocols*. + For some databases, applications can use a database protocol or SDK to directly get Arrow data. + For example, applications could use be written with [Arrow Flight SQL][flight-sql] to connect to Dremio and other databases that support the Flight SQL protocol. Review Comment: ```suggestion For example, applications could use [Arrow Flight SQL][flight-sql] to connect to Dremio and other databases that support the Flight SQL protocol. ``` (If you want, I think it's fair to link "Dremio" to the website as well.) ## _posts/2023-01-05-introducing-arrow-adbc.md: ## @@ -144,7 +144,7 @@ ADBC fills a specific niche that related projects do not address. It is both: Vendor-neutral (database APIs) - Vendor-specific (database protocols) + Database protocols Review Comment: I think it's still fair to call them vendor-specific; after all, multiple databases also use the PostgreSQL protocol (it just doesn't have a generic name). Maybe "varies by vendor (database protocols)"? ## _posts/2023-01-05-introducing-arrow-adbc.md: ## @@ -66,10 +66,10 @@ Developers have a few options: Libraries like [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc] handle row-to-columnar conversions for clients. But this doesn't fundamentally solve the problem. Unnecessary data conversions are still required. -- *Use vendor-specific protocols*. - For some databases, applications can use a database-specific protocol or SDK to directly get Arrow data. - For example, applications could use Dremio via [Arrow Flight SQL][flight-sql]. - But client applications that want to support multiple database vendors would need to integrate with each of them. +- *Directly use database protocols*. + For some databases, applications can use a database protocol or SDK to directly get Arrow data. + For example, applications could use be written with [Arrow Flight SQL][flight-sql] to connect to Dremio and other databases that support the Flight SQL protocol. + But not all databases support the Flight SQL protocol. An example is Google BigQuery, which has a separate SDK that returns Arrow data. In this case, client applications that want to support additional protocols would need to integrate with each of them. Review Comment: ```suggestion But not all databases support Flight SQL, even if they support Arrow data. An example is Google BigQuery, which has a separate SDK that returns Arrow data. In this case, client applications that want to support additional databases would need to integrate with each of their protocols. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] github-actions[bot] commented on pull request #299: MINOR: [Website] Reword ADBC announcement
github-actions[bot] commented on PR #299: URL: https://github.com/apache/arrow-site/pull/299#issuecomment-1386176466 Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Then could you also rename pull request title in the following format? ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY} See also: * [Other pull requests](https://github.com/apache/arrow-site/pulls/) * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] jduo opened a new pull request, #299: MINOR: [Website] Reword ADBC announcement
jduo opened a new pull request, #299: URL: https://github.com/apache/arrow-site/pull/299 Reword the ADBC announcement such that Flight SQL is more clearly specified as being database-agnostic rather than vendor-specific. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] alamb commented on pull request #294: [WEBSITE] DataFusion 16.0.0 blog post
alamb commented on PR #294: URL: https://github.com/apache/arrow-site/pull/294#issuecomment-1386065374 I plan to merge this tomorrow unless there are any other comments -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] alamb merged pull request #298: [WEBSITE]: Add Jie Wen / jackwener to commiters list
alamb merged PR #298: URL: https://github.com/apache/arrow-site/pull/298 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] alamb merged pull request #297: [WEBSITE]: Add Brent Gardner / avantgardnerio to committers list
alamb merged PR #297: URL: https://github.com/apache/arrow-site/pull/297 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] alamb commented on a diff in pull request #298: [WEBSITE]: Add Jie Wen / jackwener to commiters list
alamb commented on code in PR #298: URL: https://github.com/apache/arrow-site/pull/298#discussion_r1070352916 ## _data/committers.yml: ## @@ -288,6 +288,10 @@ role: Committer alias: jiayuliu affiliation: Airbnb Inc. +- name: Jie Wen + role: Committer + alias: jackwener + affiliation: TBD Review Comment: ```suggestion affiliation: SelectDB ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] alamb commented on a diff in pull request #297: [WEBSITE]: Add Brent Gardner / avantgardnerio to committers list
alamb commented on code in PR #297: URL: https://github.com/apache/arrow-site/pull/297#discussion_r1070348733 ## _data/committers.yml: ## @@ -220,6 +220,10 @@ role: Committer alias: bkamins affiliation: SGH Warsaw School of Economics +- name: Brent Gardner + role: Committer + alias: avantgardnerio + affiliation: TDB Review Comment: ```suggestion affiliation: Space and Time ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] avantgardnerio commented on pull request #297: [WEBSITE]: Add Brent Gardner / avantgardnerio to committers list
avantgardnerio commented on PR #297: URL: https://github.com/apache/arrow-site/pull/297#issuecomment-1382846564 > would you like your affiliation to be? Space and Time is sponsoring me, so it seems appropriate they get listed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] github-actions[bot] commented on pull request #298: [WEBSITE]: Add Jie Wen / jackwener to commiters list
github-actions[bot] commented on PR #298: URL: https://github.com/apache/arrow-site/pull/298#issuecomment-1382699238 Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Then could you also rename pull request title in the following format? ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY} See also: * [Other pull requests](https://github.com/apache/arrow-site/pulls/) * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] alamb opened a new pull request, #298: [WEBSITE]: Add Jie Wen / jackwener to commiters list
alamb opened a new pull request, #298: URL: https://github.com/apache/arrow-site/pull/298 Update https://arrow.apache.org/committers/ Per https://lists.apache.org/thread/o2jtvwz6v027x7k3pgdrsly2pznbrd3k @jackwener what, if anything, would you like your affiliation to be? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] github-actions[bot] commented on pull request #297: [WEBSITE]: Add Brent Gardner / avantgardnerio to committers list
github-actions[bot] commented on PR #297: URL: https://github.com/apache/arrow-site/pull/297#issuecomment-1382699098 Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Then could you also rename pull request title in the following format? ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY} See also: * [Other pull requests](https://github.com/apache/arrow-site/pulls/) * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] alamb opened a new pull request, #297: [WEBSITE]: Add Brent Gardner / avantgardnerio to committers list
alamb opened a new pull request, #297: URL: https://github.com/apache/arrow-site/pull/297 Update https://arrow.apache.org/committers/ Per https://lists.apache.org/thread/0cqwzhnftbnbbf3x1o209dnkoz5gbqd3 @avantgardnerio what, if anything, would you like your affiliation to be? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] kou merged pull request #295: [Website] Add links to UKV
kou merged PR #295: URL: https://github.com/apache/arrow-site/pull/295 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] kou commented on a diff in pull request #295: [Website] Add links to UKV
kou commented on code in PR #295: URL: https://github.com/apache/arrow-site/pull/295#discussion_r1068676015 ## use_cases.md: ## @@ -81,7 +83,8 @@ and [others]({{ site.baseurl }}/powered_by/) also use Arrow similarly. The Arrow project also defines [Flight]({% post_url 2019-09-30-introducing-arrow-flight %}), a client-server RPC framework to build rich services exchanging data according -to application-defined semantics. +to application-defined semantics. Flight RPC is used by [UKV](https://unum.cloud/ukv) +to exchange tables, documents, and graphs, between server application and client SDKs. Review Comment: Apache Arrow project provides https://arrow.apache.org/powered_by/ as a page for introducing third-party projects including their use case. It's mentioned explicitly: > To add yourself to the list, please open a [pull request](https://github.com/apache/arrow-site/edit/master/powered_by.md) adding your organization name, URL, a list of which Arrow components you are using, and a short description of your use case. But other pages such as https://arrow.apache.org/use_cases/ haven't discussed for the purpose explicitly. If you think that Apache Arrow project should use https://arrow.apache.org/use_cases/ for the purpose too, could you start a discussion on `d...@arrow.apache.org` mailing list? https://arrow.apache.org/community/ > dev@ is for discussions about contributing to the project development ([subscribe](mailto:dev-subscr...@arrow.apache.org?subject=Subscribe), [unsubscribe](mailto:dev-unsubscr...@arrow.apache.org?subject=Unubscribe), [archives](https://lists.apache.org/list.html?d...@arrow.apache.org)) FYI: Apache Software Foundation provides suggested practices related to this topic: https://www.apache.org/foundation/marks/linking -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] lidavidm merged pull request #296: [Website] Add ADBC release post
lidavidm merged PR #296: URL: https://github.com/apache/arrow-site/pull/296 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] lidavidm commented on pull request #296: [Website] Add ADBC release post
lidavidm commented on PR #296: URL: https://github.com/apache/arrow-site/pull/296#issuecomment-1380499228 I'll post this later today if there's no objections. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] ashvardanian commented on a diff in pull request #295: [Website] Add links to UKV
ashvardanian commented on code in PR #295: URL: https://github.com/apache/arrow-site/pull/295#discussion_r1068146656 ## use_cases.md: ## @@ -81,7 +83,8 @@ and [others]({{ site.baseurl }}/powered_by/) also use Arrow similarly. The Arrow project also defines [Flight]({% post_url 2019-09-30-introducing-arrow-flight %}), a client-server RPC framework to build rich services exchanging data according -to application-defined semantics. +to application-defined semantics. Flight RPC is used by [UKV](https://unum.cloud/ukv) +to exchange tables, documents, and graphs, between server application and client SDKs. Review Comment: Reverted all the changes in `use_cases.md`. Apache Spark, Google BigQuery, TensorFlow, and AWS Athena were all mentioned in that paragraph, so I thought it might be the right place to mention them. We rely on Arrow representations for the same purpose but with a much broader scope than any of mentioned projects. Maybe we can add the reference another time. Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] kou commented on a diff in pull request #295: [Website] Add links to UKV
kou commented on code in PR #295: URL: https://github.com/apache/arrow-site/pull/295#discussion_r106832 ## use_cases.md: ## @@ -81,7 +83,8 @@ and [others]({{ site.baseurl }}/powered_by/) also use Arrow similarly. The Arrow project also defines [Flight]({% post_url 2019-09-30-introducing-arrow-flight %}), a client-server RPC framework to build rich services exchanging data according -to application-defined semantics. +to application-defined semantics. Flight RPC is used by [UKV](https://unum.cloud/ukv) +to exchange tables, documents, and graphs, between server application and client SDKs. Review Comment: I mean that reverting all changes in `use_cases.md`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] andygrove commented on a diff in pull request #294: [WEBSITE] DataFusion 16.0.0 blog post
andygrove commented on code in PR #294: URL: https://github.com/apache/arrow-site/pull/294#discussion_r1067412821 ## _posts/2023-01-07-datafusion-16.0.0.md: ## @@ -0,0 +1,289 @@ +--- +layout: post +title: "Apache Arrow DataFusion 16.0.0 Project Update" +date: "2023-01-07 00:00:00" +author: pmc +categories: [release] +--- + + +# Introduction + +[DataFusion](https://arrow.apache.org/datafusion/) is an extensible +query execution framework, written in [Rust](https://www.rust-lang.org/), +that uses [Apache Arrow](https://arrow.apache.org) as its +in-memory format. It is targeted primarily at developers creating data +intensive analytics, and offers mature +[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html), +a DataFrame API, and many extension points. + +DataFusion based systems perform very well on performance +benchmarks, especially considering they operate on data in parquet +files directly rather than first loading into a specialized format. +Some recent highlights include [clickbench](https://benchmark.clickhouse.com/) +and the [Cloudfuse.io standalone query engines](https://www.cloudfuse.io/dashboards/standalone-engines) page. + +DataFusion is part of a longer term trend, articulated clearly by [Andy Pavlo](http://www.cs.cmu.edu/~pavlo/) in his +[2022 Databases Retrospective](https://ottertune.com/blog/2022-databases-retrospective/). +Database frameworks are proliferating and it is likely that all OLAP DBMSs and other many data heavy applications such as machine learning, will require a vectorized, highly performant query +engine in the next 5 years to remain relevant. +The only practical way to make such technology so widely available +without many millions of dollars of investment is +though open source engine such as DataFusion or [Velox](https://github.com/facebookincubator/velox). + +The rest of this post describes the improvements made to DataFusion +over the last three months and some hints if where we are heading. + +## Community Growth + +The three months since [our last update](https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/) again saw significant growth in the DataFusion. +TODO quantify the growth -- e.g. XXX new contributors to the project and regularly merge YYY PRs a day. + +Growth of new systems based on as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. + +Several new databases built on datafusion (synnada.ai, greptimedb, probably others) +GA of InfluxDB IOx + + +The DataFusion 16.0.0 release consists of 520 PRs from 70 distinct contributors. This does not count all the work that goes into our dependencies such as [arrow](https://crates.io/crates/arrow), [parquet](https://crates.io/crates/parquet), and [object_store](https://crates.io/crates/object_store), that much of the same community helps nurture. + + + +## Performance + +Performance and efficiency are core value propositions for +DataFusion. While there is still a performance gap between DataFusion best of +breed tightly, integrated systems such as [DuckDB](https://duckdb.org) +and [Polars](https://www.pola.rs/)https://www.pola.rs/), DataFusion is +closing the gap quickly. Performance highlights from the last three +months: + +* XX% Faster Sorting and Merging using the new [Row Format](https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/) +* [Advanced predicate pushdown](https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/), directly on parquet, optionally directly from object storage, enabling sub millisecond filtering, directly from object storage +* Improved `IN` expressions significantly faster Simplify InListExpr ~20-70% Faster ([#4057]) +* Sort and partition aware optimizations such as #3969 and #4691, skipping potentially expensive operations +* Basic filter selectivity analysis (#3868) + + +In the coming few months, we plan work on: +* Improved grouping performance (TODO link) +* bloom filtering +* investigate RLE (Run End Encoding support) (todo Arrow link) +* Enable predicate pushdown by default for all cases +* OTHERS? + +## Runtime Resource Limits + +Initially, DataFusion could potentially use unbounded amounts of memory for certain queries that included Sorts, Grouping or Joins. + +In version 16.0.0, it is possible to limit DataFusion's memory usage for Sorting and Grouping. We are looking for help adding similar limiting for Joins as well as expanding our algorithms to spill to secondary storage, if available. See #3941 fore more detail. + + +## SQL Window Function +[SQL window functions](https://en.wikipedia.org/wiki/Window_function_(SQL)) are useful for a variety of analysis and DataFusion's implementation is close to complete now. + +- Custom window frames such as `... OVER (ORDER BY ... RANGE BETWEEN 0.2 PRECEDING AND 0.2
[GitHub] [arrow-site] andygrove commented on a diff in pull request #294: [WEBSITE] DataFusion 16.0.0 blog post
andygrove commented on code in PR #294: URL: https://github.com/apache/arrow-site/pull/294#discussion_r1067412293 ## _posts/2023-01-07-datafusion-16.0.0.md: ## @@ -0,0 +1,289 @@ +--- +layout: post +title: "Apache Arrow DataFusion 16.0.0 Project Update" +date: "2023-01-07 00:00:00" +author: pmc +categories: [release] +--- + + +# Introduction + +[DataFusion](https://arrow.apache.org/datafusion/) is an extensible +query execution framework, written in [Rust](https://www.rust-lang.org/), +that uses [Apache Arrow](https://arrow.apache.org) as its +in-memory format. It is targeted primarily at developers creating data +intensive analytics, and offers mature +[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html), +a DataFrame API, and many extension points. + +DataFusion based systems perform very well on performance +benchmarks, especially considering they operate on data in parquet +files directly rather than first loading into a specialized format. +Some recent highlights include [clickbench](https://benchmark.clickhouse.com/) +and the [Cloudfuse.io standalone query engines](https://www.cloudfuse.io/dashboards/standalone-engines) page. + +DataFusion is part of a longer term trend, articulated clearly by [Andy Pavlo](http://www.cs.cmu.edu/~pavlo/) in his +[2022 Databases Retrospective](https://ottertune.com/blog/2022-databases-retrospective/). +Database frameworks are proliferating and it is likely that all OLAP DBMSs and other many data heavy applications such as machine learning, will require a vectorized, highly performant query +engine in the next 5 years to remain relevant. +The only practical way to make such technology so widely available +without many millions of dollars of investment is +though open source engine such as DataFusion or [Velox](https://github.com/facebookincubator/velox). + +The rest of this post describes the improvements made to DataFusion +over the last three months and some hints if where we are heading. + +## Community Growth + +The three months since [our last update](https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/) again saw significant growth in the DataFusion. +TODO quantify the growth -- e.g. XXX new contributors to the project and regularly merge YYY PRs a day. + +Growth of new systems based on as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. + +Several new databases built on datafusion (synnada.ai, greptimedb, probably others) +GA of InfluxDB IOx + + +The DataFusion 16.0.0 release consists of 520 PRs from 70 distinct contributors. This does not count all the work that goes into our dependencies such as [arrow](https://crates.io/crates/arrow), [parquet](https://crates.io/crates/parquet), and [object_store](https://crates.io/crates/object_store), that much of the same community helps nurture. + + + +## Performance + +Performance and efficiency are core value propositions for +DataFusion. While there is still a performance gap between DataFusion best of +breed tightly, integrated systems such as [DuckDB](https://duckdb.org) +and [Polars](https://www.pola.rs/)https://www.pola.rs/), DataFusion is +closing the gap quickly. Performance highlights from the last three +months: + +* XX% Faster Sorting and Merging using the new [Row Format](https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/) +* [Advanced predicate pushdown](https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/), directly on parquet, optionally directly from object storage, enabling sub millisecond filtering, directly from object storage +* Improved `IN` expressions significantly faster Simplify InListExpr ~20-70% Faster ([#4057]) +* Sort and partition aware optimizations such as #3969 and #4691, skipping potentially expensive operations +* Basic filter selectivity analysis (#3868) + + +In the coming few months, we plan work on: +* Improved grouping performance (TODO link) +* bloom filtering +* investigate RLE (Run End Encoding support) (todo Arrow link) +* Enable predicate pushdown by default for all cases +* OTHERS? + +## Runtime Resource Limits + +Initially, DataFusion could potentially use unbounded amounts of memory for certain queries that included Sorts, Grouping or Joins. + +In version 16.0.0, it is possible to limit DataFusion's memory usage for Sorting and Grouping. We are looking for help adding similar limiting for Joins as well as expanding our algorithms to spill to secondary storage, if available. See #3941 fore more detail. + + +## SQL Window Function +[SQL window functions](https://en.wikipedia.org/wiki/Window_function_(SQL)) are useful for a variety of analysis and DataFusion's implementation is close to complete now. + +- Custom window frames such as `... OVER (ORDER BY ... RANGE BETWEEN 0.2 PRECEDING AND 0.2
[GitHub] [arrow-site] andygrove commented on a diff in pull request #294: [WEBSITE] DataFusion 16.0.0 blog post
andygrove commented on code in PR #294: URL: https://github.com/apache/arrow-site/pull/294#discussion_r1067410437 ## _posts/2023-01-07-datafusion-16.0.0.md: ## @@ -0,0 +1,289 @@ +--- +layout: post +title: "Apache Arrow DataFusion 16.0.0 Project Update" +date: "2023-01-07 00:00:00" +author: pmc +categories: [release] +--- + + +# Introduction + +[DataFusion](https://arrow.apache.org/datafusion/) is an extensible +query execution framework, written in [Rust](https://www.rust-lang.org/), +that uses [Apache Arrow](https://arrow.apache.org) as its +in-memory format. It is targeted primarily at developers creating data +intensive analytics, and offers mature +[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html), +a DataFrame API, and many extension points. + +DataFusion based systems perform very well on performance +benchmarks, especially considering they operate on data in parquet +files directly rather than first loading into a specialized format. +Some recent highlights include [clickbench](https://benchmark.clickhouse.com/) +and the [Cloudfuse.io standalone query engines](https://www.cloudfuse.io/dashboards/standalone-engines) page. + +DataFusion is part of a longer term trend, articulated clearly by [Andy Pavlo](http://www.cs.cmu.edu/~pavlo/) in his +[2022 Databases Retrospective](https://ottertune.com/blog/2022-databases-retrospective/). +Database frameworks are proliferating and it is likely that all OLAP DBMSs and other many data heavy applications such as machine learning, will require a vectorized, highly performant query +engine in the next 5 years to remain relevant. +The only practical way to make such technology so widely available +without many millions of dollars of investment is +though open source engine such as DataFusion or [Velox](https://github.com/facebookincubator/velox). + +The rest of this post describes the improvements made to DataFusion +over the last three months and some hints if where we are heading. + +## Community Growth + +The three months since [our last update](https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/) again saw significant growth in the DataFusion. +TODO quantify the growth -- e.g. XXX new contributors to the project and regularly merge YYY PRs a day. + +Growth of new systems based on as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. + +Several new databases built on datafusion (synnada.ai, greptimedb, probably others) +GA of InfluxDB IOx + + +The DataFusion 16.0.0 release consists of 520 PRs from 70 distinct contributors. This does not count all the work that goes into our dependencies such as [arrow](https://crates.io/crates/arrow), [parquet](https://crates.io/crates/parquet), and [object_store](https://crates.io/crates/object_store), that much of the same community helps nurture. + + + +## Performance + +Performance and efficiency are core value propositions for +DataFusion. While there is still a performance gap between DataFusion best of +breed tightly, integrated systems such as [DuckDB](https://duckdb.org) +and [Polars](https://www.pola.rs/)https://www.pola.rs/), DataFusion is +closing the gap quickly. Performance highlights from the last three +months: + +* XX% Faster Sorting and Merging using the new [Row Format](https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/) +* [Advanced predicate pushdown](https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/), directly on parquet, optionally directly from object storage, enabling sub millisecond filtering, directly from object storage +* Improved `IN` expressions significantly faster Simplify InListExpr ~20-70% Faster ([#4057]) +* Sort and partition aware optimizations such as #3969 and #4691, skipping potentially expensive operations +* Basic filter selectivity analysis (#3868) + + +In the coming few months, we plan work on: +* Improved grouping performance (TODO link) +* bloom filtering +* investigate RLE (Run End Encoding support) (todo Arrow link) +* Enable predicate pushdown by default for all cases +* OTHERS? + +## Runtime Resource Limits + +Initially, DataFusion could potentially use unbounded amounts of memory for certain queries that included Sorts, Grouping or Joins. + +In version 16.0.0, it is possible to limit DataFusion's memory usage for Sorting and Grouping. We are looking for help adding similar limiting for Joins as well as expanding our algorithms to spill to secondary storage, if available. See #3941 fore more detail. + + +## SQL Window Function +[SQL window functions](https://en.wikipedia.org/wiki/Window_function_(SQL)) are useful for a variety of analysis and DataFusion's implementation is close to complete now. + +- Custom window frames such as `... OVER (ORDER BY ... RANGE BETWEEN 0.2 PRECEDING AND 0.2
[GitHub] [arrow-site] ashvardanian commented on a diff in pull request #295: [Website] Add links to UKV
ashvardanian commented on code in PR #295: URL: https://github.com/apache/arrow-site/pull/295#discussion_r1066841543 ## use_cases.md: ## @@ -81,7 +83,8 @@ and [others]({{ site.baseurl }}/powered_by/) also use Arrow similarly. The Arrow project also defines [Flight]({% post_url 2019-09-30-introducing-arrow-flight %}), a client-server RPC framework to build rich services exchanging data according -to application-defined semantics. +to application-defined semantics. Flight RPC is used by [UKV](https://unum.cloud/ukv) +to exchange tables, documents, and graphs, between server application and client SDKs. Review Comment: You mean removing the link duplicate, or the whole contents of the line? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] alamb commented on a diff in pull request #294: [WEBSITE] DataFusion 16.0.0 blog post
alamb commented on code in PR #294: URL: https://github.com/apache/arrow-site/pull/294#discussion_r1066423131 ## _posts/2023-01-07-datafusion-16.0.0.md: ## @@ -0,0 +1,308 @@ +--- +layout: post +title: "Apache Arrow DataFusion 16.0.0 Project Update" +date: "2023-01-07 00:00:00" +author: pmc +categories: [release] +--- + + +# Introduction + +[DataFusion](https://arrow.apache.org/datafusion/) is an extensible +query execution framework, written in [Rust](https://www.rust-lang.org/), +that uses [Apache Arrow](https://arrow.apache.org) as its +in-memory format. It is targeted primarily at developers creating data +intensive analytics, and offers mature +[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html), +a DataFrame API, and many extension points. + +Systems based on DataFusion perform very well in benchmarks, +especially considering they operate directly on parquet files rather +than first loading into a specialized format. Some recent highlights +include [clickbench](https://benchmark.clickhouse.com/) and the +[Cloudfuse.io standalone query +engines](https://www.cloudfuse.io/dashboards/standalone-engines) page. + +DataFusion is also part of a longer term trend, articulated clearly by +[Andy Pavlo](http://www.cs.cmu.edu/~pavlo/) in his [2022 Databases +Retrospective](https://ottertune.com/blog/2022-databases-retrospective/). +Database frameworks are proliferating and it is likely that all OLAP +DBMSs and other data heavy applications, such as machine learning, +will **require** a vectorized, highly performant query engine in the next +5 years to remain relevant. The only practical way to make such +technology so widely available without many millions of dollars of +investment is though open source engine such as DataFusion or +[Velox](https://github.com/facebookincubator/velox). + +The rest of this post describes the improvements made to DataFusion +over the last three months and some hints of where we are heading. + + +## Community Growth + +We again saw significant growth in the DataFusion community since [our last update](https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/). There are some interesting metrics on [OSSRank](https://ossrank.com/p/1573-apache-arrow-datafusion). + +The DataFusion 16.0.0 release consists of 524 PRs from 70 distinct contributors, not including all the work that goes into dependencies such as [arrow](https://crates.io/crates/arrow), [parquet](https://crates.io/crates/parquet), and [object_store](https://crates.io/crates/object_store), that much of the same community helps support. Thank you all for your help + + +Several [new systems based on DataFusion](https://github.com/apache/arrow-datafusion#known-uses) were recently added: + +* [Greptime DB](https://github.com/GreptimeTeam/greptimedb) +* [Synnada](https://synnada.ai/) +* [PRQL](https://github.com/PRQL/prql-query) +- [Parseable](https://github.com/parseablehq/parseable) +* [SeaFowl](https://github.com/splitgraph/seafowl) + + +## Performance + +Performance and efficiency are core values for +DataFusion. While there is still a gap between DataFusion and the best of +breed, tightly integrated systems such as [DuckDB](https://duckdb.org) +and [Polars](https://www.pola.rs/), DataFusion is +closing the gap quickly. Performance highlights from the last three +months: + +* Up to 30% Faster Sorting and Merging using the new [Row Format](https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/) +* [Advanced predicate pushdown](https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/), directly on parquet, directly from object storage, enabling sub millisecond filtering. +* `70%` faster `IN` expressions evaluation ([#4057]) +* Sort and partition aware optimizations ([#3969] and [#4691]) +* Filter selectivity analysis ([#3868]) + +## Runtime Resource Limits + +Previously, DataFusion could potentially use unbounded amounts of memory for certain queries that included Sorts, Grouping or Joins. + +In version 16.0.0, it is possible to limit DataFusion's memory usage for Sorting and Grouping. We are looking for help adding similar limiting for Joins as well as expanding our algorithms to optionally spill to secondary storage. See [#3941] for more detail. + + +## SQL Window Functions + +[SQL Window Functions](https://en.wikipedia.org/wiki/Window_function_(SQL)) are useful for a variety of analysis and DataFusion's implementation support expanded significantly: + +- Custom window frames such as `... OVER (ORDER BY ... RANGE BETWEEN 0.2 PRECEDING AND 0.2 FOLLOWING)` +- Unbounded window frames such as `... OVER (ORDER BY ... RANGE UNBOUNDED ROWS PRECEDING)` +- Support for the `NTILE` window function ([#4676]) +- Support for `GROUPS` mode ([#4155]) + + +# Improved Joins + +Joins are often the most complicated operations to handle well in +analytics systems and DataFusion 16.0.0 offers significant improvements +such
[GitHub] [arrow-site] lidavidm commented on pull request #296: [Website] Add ADBC release post
lidavidm commented on PR #296: URL: https://github.com/apache/arrow-site/pull/296#issuecomment-1377979146 Updated, thanks Ian! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] ozankabak commented on a diff in pull request #294: [WEBSITE] DataFusion 16.0.0 blog post
ozankabak commented on code in PR #294: URL: https://github.com/apache/arrow-site/pull/294#discussion_r1066390458 ## _posts/2023-01-07-datafusion-16.0.0.md: ## @@ -0,0 +1,308 @@ +--- +layout: post +title: "Apache Arrow DataFusion 16.0.0 Project Update" +date: "2023-01-07 00:00:00" +author: pmc +categories: [release] +--- + + +# Introduction + +[DataFusion](https://arrow.apache.org/datafusion/) is an extensible +query execution framework, written in [Rust](https://www.rust-lang.org/), +that uses [Apache Arrow](https://arrow.apache.org) as its +in-memory format. It is targeted primarily at developers creating data +intensive analytics, and offers mature +[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html), +a DataFrame API, and many extension points. + +Systems based on DataFusion perform very well in benchmarks, +especially considering they operate directly on parquet files rather +than first loading into a specialized format. Some recent highlights +include [clickbench](https://benchmark.clickhouse.com/) and the +[Cloudfuse.io standalone query +engines](https://www.cloudfuse.io/dashboards/standalone-engines) page. + +DataFusion is also part of a longer term trend, articulated clearly by +[Andy Pavlo](http://www.cs.cmu.edu/~pavlo/) in his [2022 Databases +Retrospective](https://ottertune.com/blog/2022-databases-retrospective/). +Database frameworks are proliferating and it is likely that all OLAP +DBMSs and other data heavy applications, such as machine learning, +will **require** a vectorized, highly performant query engine in the next +5 years to remain relevant. The only practical way to make such +technology so widely available without many millions of dollars of +investment is though open source engine such as DataFusion or +[Velox](https://github.com/facebookincubator/velox). + +The rest of this post describes the improvements made to DataFusion +over the last three months and some hints of where we are heading. + + +## Community Growth + +We again saw significant growth in the DataFusion community since [our last update](https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/). There are some interesting metrics on [OSSRank](https://ossrank.com/p/1573-apache-arrow-datafusion). + +The DataFusion 16.0.0 release consists of 524 PRs from 70 distinct contributors, not including all the work that goes into dependencies such as [arrow](https://crates.io/crates/arrow), [parquet](https://crates.io/crates/parquet), and [object_store](https://crates.io/crates/object_store), that much of the same community helps support. Thank you all for your help + + +Several [new systems based on DataFusion](https://github.com/apache/arrow-datafusion#known-uses) were recently added: + +* [Greptime DB](https://github.com/GreptimeTeam/greptimedb) +* [Synnada](https://synnada.ai/) +* [PRQL](https://github.com/PRQL/prql-query) +- [Parseable](https://github.com/parseablehq/parseable) +* [SeaFowl](https://github.com/splitgraph/seafowl) + + +## Performance + +Performance and efficiency are core values for +DataFusion. While there is still a gap between DataFusion and the best of +breed, tightly integrated systems such as [DuckDB](https://duckdb.org) +and [Polars](https://www.pola.rs/), DataFusion is +closing the gap quickly. Performance highlights from the last three +months: + +* Up to 30% Faster Sorting and Merging using the new [Row Format](https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/) +* [Advanced predicate pushdown](https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/), directly on parquet, directly from object storage, enabling sub millisecond filtering. +* `70%` faster `IN` expressions evaluation ([#4057]) +* Sort and partition aware optimizations ([#3969] and [#4691]) +* Filter selectivity analysis ([#3868]) + +## Runtime Resource Limits + +Previously, DataFusion could potentially use unbounded amounts of memory for certain queries that included Sorts, Grouping or Joins. + +In version 16.0.0, it is possible to limit DataFusion's memory usage for Sorting and Grouping. We are looking for help adding similar limiting for Joins as well as expanding our algorithms to optionally spill to secondary storage. See [#3941] for more detail. + + +## SQL Window Functions + +[SQL Window Functions](https://en.wikipedia.org/wiki/Window_function_(SQL)) are useful for a variety of analysis and DataFusion's implementation support expanded significantly: + +- Custom window frames such as `... OVER (ORDER BY ... RANGE BETWEEN 0.2 PRECEDING AND 0.2 FOLLOWING)` +- Unbounded window frames such as `... OVER (ORDER BY ... RANGE UNBOUNDED ROWS PRECEDING)` +- Support for the `NTILE` window function ([#4676]) +- Support for `GROUPS` mode ([#4155]) + + +# Improved Joins + +Joins are often the most complicated operations to handle well in +analytics systems and DataFusion 16.0.0 offers significant improvements
[GitHub] [arrow-site] alamb commented on pull request #294: [WEBSITE] DataFusion 16.0.0 blog post
alamb commented on PR #294: URL: https://github.com/apache/arrow-site/pull/294#issuecomment-1377924175 Ok I think this one is now ready for some more review -- it is plausibly ready to publish -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] alamb commented on a diff in pull request #294: [WEBSITE] DataFusion 16.0.0 blog post (WIP)
alamb commented on code in PR #294: URL: https://github.com/apache/arrow-site/pull/294#discussion_r1066356813 ## _posts/2023-01-07-datafusion-16.0.0.md: ## @@ -0,0 +1,289 @@ +--- +layout: post +title: "Apache Arrow DataFusion 16.0.0 Project Update" +date: "2023-01-07 00:00:00" +author: pmc +categories: [release] +--- + + +# Introduction + +[DataFusion](https://arrow.apache.org/datafusion/) is an extensible +query execution framework, written in [Rust](https://www.rust-lang.org/), +that uses [Apache Arrow](https://arrow.apache.org) as its +in-memory format. It is targeted primarily at developers creating data +intensive analytics, and offers mature +[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html), +a DataFrame API, and many extension points. + +DataFusion based systems perform very well on performance +benchmarks, especially considering they operate on data in parquet +files directly rather than first loading into a specialized format. +Some recent highlights include [clickbench](https://benchmark.clickhouse.com/) +and the [Cloudfuse.io standalone query engines](https://www.cloudfuse.io/dashboards/standalone-engines) page. + +DataFusion is part of a longer term trend, articulated clearly by [Andy Pavlo](http://www.cs.cmu.edu/~pavlo/) in his +[2022 Databases Retrospective](https://ottertune.com/blog/2022-databases-retrospective/). +Database frameworks are proliferating and it is likely that all OLAP DBMSs and other many data heavy applications such as machine learning, will require a vectorized, highly performant query +engine in the next 5 years to remain relevant. +The only practical way to make such technology so widely available +without many millions of dollars of investment is +though open source engine such as DataFusion or [Velox](https://github.com/facebookincubator/velox). + +The rest of this post describes the improvements made to DataFusion +over the last three months and some hints if where we are heading. + +## Community Growth + +The three months since [our last update](https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/) again saw significant growth in the DataFusion. +TODO quantify the growth -- e.g. XXX new contributors to the project and regularly merge YYY PRs a day. + +Growth of new systems based on as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. + +Several new databases built on datafusion (synnada.ai, greptimedb, probably others) +GA of InfluxDB IOx + + +The DataFusion 16.0.0 release consists of 520 PRs from 70 distinct contributors. This does not count all the work that goes into our dependencies such as [arrow](https://crates.io/crates/arrow), [parquet](https://crates.io/crates/parquet), and [object_store](https://crates.io/crates/object_store), that much of the same community helps nurture. + + + +## Performance + +Performance and efficiency are core value propositions for +DataFusion. While there is still a performance gap between DataFusion best of +breed tightly, integrated systems such as [DuckDB](https://duckdb.org) +and [Polars](https://www.pola.rs/)https://www.pola.rs/), DataFusion is +closing the gap quickly. Performance highlights from the last three +months: + +* XX% Faster Sorting and Merging using the new [Row Format](https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/) +* [Advanced predicate pushdown](https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/), directly on parquet, optionally directly from object storage, enabling sub millisecond filtering, directly from object storage +* Improved `IN` expressions significantly faster Simplify InListExpr ~20-70% Faster ([#4057]) +* Sort and partition aware optimizations such as #3969 and #4691, skipping potentially expensive operations +* Basic filter selectivity analysis (#3868) + + +In the coming few months, we plan work on: +* Improved grouping performance (TODO link) +* bloom filtering +* investigate RLE (Run End Encoding support) (todo Arrow link) +* Enable predicate pushdown by default for all cases +* OTHERS? + +## Runtime Resource Limits + +Initially, DataFusion could potentially use unbounded amounts of memory for certain queries that included Sorts, Grouping or Joins. + +In version 16.0.0, it is possible to limit DataFusion's memory usage for Sorting and Grouping. We are looking for help adding similar limiting for Joins as well as expanding our algorithms to spill to secondary storage, if available. See #3941 fore more detail. + + +## SQL Window Function +[SQL window functions](https://en.wikipedia.org/wiki/Window_function_(SQL)) are useful for a variety of analysis and DataFusion's implementation is close to complete now. + +- Custom window frames such as `... OVER (ORDER BY ... RANGE BETWEEN 0.2 PRECEDING AND 0.2
[GitHub] [arrow-site] ianmcook commented on a diff in pull request #296: [Website] Add ADBC release post
ianmcook commented on code in PR #296: URL: https://github.com/apache/arrow-site/pull/296#discussion_r1066309697 ## _posts/2023-01-13-adbc-0.1.0-release.md: ## @@ -0,0 +1,79 @@ +--- +layout: post +title: "Apache Arrow ADBC 0.1.0 (Libraries) Release" +date: "2023-01-13 00:00:00" +author: pmc +categories: [release] +--- + + +The Apache Arrow team is pleased to announce the 0.1.0 release of the +Apache Arrow ADBC libraries. This covers includes [**63 resolved +issues**][1] from [**8 distinct contributors**][2]. + +This is a release of the **libraries**, which are at version 0.1.0. +The **API specification** is versioned separately and is at version +1.0.0. Review Comment: This might be a good place to add a link to the Introducing ADBC blog post for readers interested in learning more about the specification -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] ianmcook commented on a diff in pull request #296: [Website] Add ADBC release post
ianmcook commented on code in PR #296: URL: https://github.com/apache/arrow-site/pull/296#discussion_r1066308344 ## _posts/2023-01-13-adbc-0.1.0-release.md: ## @@ -0,0 +1,79 @@ +--- +layout: post +title: "Apache Arrow ADBC 0.1.0 (Libraries) Release" +date: "2023-01-13 00:00:00" +author: pmc +categories: [release] +--- + + +The Apache Arrow team is pleased to announce the 0.1.0 release of the +Apache Arrow ADBC libraries. This covers includes [**63 resolved +issues**][1] from [**8 distinct contributors**][2]. + +This is a release of the **libraries**, which are at version 0.1.0. +The **API specification** is versioned separately and is at version +1.0.0. + +The release notes below are not exhaustive and only expose selected highlights +of the release. Many other bugfixes and improvements have been made: we refer +you to the [complete changelog][3]. + +## Release Highlights + +This initial release includes the following: + +- Driver manager libraries for C/C++, Go, Java, Python, and Ruby. +- ADBC drivers for SQLite and PostgreSQL, available in C/C++, Go, Python, and Ruby. +- ADBC drivers for Arrow FLight SQL and JDBC, available in Java. + +## Contributors + +``` +$ git shortlog -sn apache-arrow-adbc-0.1.0 Review Comment: Maybe use one of [these tricks](https://stackoverflow.com/questions/6889830/equivalence-of-git-log-exclude-author) to exclude dependabot from the output -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] github-actions[bot] commented on pull request #296: [Website] Add ADBC release post
github-actions[bot] commented on PR #296: URL: https://github.com/apache/arrow-site/pull/296#issuecomment-1377705589 Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Then could you also rename pull request title in the following format? ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY} See also: * [Other pull requests](https://github.com/apache/arrow-site/pulls/) * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] andygrove commented on pull request #294: [WEBSITE] DataFusion 16.0.0 blog post (WIP)
andygrove commented on PR #294: URL: https://github.com/apache/arrow-site/pull/294#issuecomment-1377545928 I will start contributing to this tomorrow -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] alamb commented on a diff in pull request #294: [WEBSITE] DataFusion 16.0.0 blog post (WIP)
alamb commented on code in PR #294: URL: https://github.com/apache/arrow-site/pull/294#discussion_r1065816100 ## _posts/2023-01-07-datafusion-16.0.0.md: ## @@ -0,0 +1,289 @@ +--- +layout: post +title: "Apache Arrow DataFusion 16.0.0 Project Update" +date: "2023-01-07 00:00:00" +author: pmc +categories: [release] +--- + + +# Introduction + +[DataFusion](https://arrow.apache.org/datafusion/) is an extensible +query execution framework, written in [Rust](https://www.rust-lang.org/), +that uses [Apache Arrow](https://arrow.apache.org) as its +in-memory format. It is targeted primarily at developers creating data +intensive analytics, and offers mature +[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html), +a DataFrame API, and many extension points. + +DataFusion based systems perform very well on performance +benchmarks, especially considering they operate on data in parquet +files directly rather than first loading into a specialized format. +Some recent highlights include [clickbench](https://benchmark.clickhouse.com/) +and the [Cloudfuse.io standalone query engines](https://www.cloudfuse.io/dashboards/standalone-engines) page. + +DataFusion is part of a longer term trend, articulated clearly by [Andy Pavlo](http://www.cs.cmu.edu/~pavlo/) in his +[2022 Databases Retrospective](https://ottertune.com/blog/2022-databases-retrospective/). +Database frameworks are proliferating and it is likely that all OLAP DBMSs and other many data heavy applications such as machine learning, will require a vectorized, highly performant query +engine in the next 5 years to remain relevant. +The only practical way to make such technology so widely available +without many millions of dollars of investment is +though open source engine such as DataFusion or [Velox](https://github.com/facebookincubator/velox). + +The rest of this post describes the improvements made to DataFusion +over the last three months and some hints of where we are heading. + +## Community Growth + +The three months since [our last update](https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/) again saw significant growth in the DataFusion. +TODO quantify the growth -- e.g. XXX new contributors to the project and regularly merge YYY PRs a day. + +Growth of new systems based on as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. + +Several new databases built on datafusion (synnada.ai, greptimedb, probably others) Review Comment: Thanks -- added in ffe2e0af210. Still needs polish -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] liukun4515 commented on a diff in pull request #294: [WEBSITE] DataFusion 16.0.0 blog post (WIP)
liukun4515 commented on code in PR #294: URL: https://github.com/apache/arrow-site/pull/294#discussion_r1065740157 ## _posts/2023-01-07-datafusion-16.0.0.md: ## @@ -0,0 +1,289 @@ +--- +layout: post +title: "Apache Arrow DataFusion 16.0.0 Project Update" +date: "2023-01-07 00:00:00" +author: pmc +categories: [release] +--- + + +# Introduction + +[DataFusion](https://arrow.apache.org/datafusion/) is an extensible +query execution framework, written in [Rust](https://www.rust-lang.org/), +that uses [Apache Arrow](https://arrow.apache.org) as its +in-memory format. It is targeted primarily at developers creating data +intensive analytics, and offers mature +[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html), +a DataFrame API, and many extension points. + +DataFusion based systems perform very well on performance +benchmarks, especially considering they operate on data in parquet +files directly rather than first loading into a specialized format. +Some recent highlights include [clickbench](https://benchmark.clickhouse.com/) +and the [Cloudfuse.io standalone query engines](https://www.cloudfuse.io/dashboards/standalone-engines) page. + +DataFusion is part of a longer term trend, articulated clearly by [Andy Pavlo](http://www.cs.cmu.edu/~pavlo/) in his +[2022 Databases Retrospective](https://ottertune.com/blog/2022-databases-retrospective/). +Database frameworks are proliferating and it is likely that all OLAP DBMSs and other many data heavy applications such as machine learning, will require a vectorized, highly performant query +engine in the next 5 years to remain relevant. +The only practical way to make such technology so widely available +without many millions of dollars of investment is +though open source engine such as DataFusion or [Velox](https://github.com/facebookincubator/velox). + +The rest of this post describes the improvements made to DataFusion +over the last three months and some hints if where we are heading. + +## Community Growth + +The three months since [our last update](https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/) again saw significant growth in the DataFusion. +TODO quantify the growth -- e.g. XXX new contributors to the project and regularly merge YYY PRs a day. + +Growth of new systems based on as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. + +Several new databases built on datafusion (synnada.ai, greptimedb, probably others) +GA of InfluxDB IOx + + +The DataFusion 16.0.0 release consists of 520 PRs from 70 distinct contributors. This does not count all the work that goes into our dependencies such as [arrow](https://crates.io/crates/arrow), [parquet](https://crates.io/crates/parquet), and [object_store](https://crates.io/crates/object_store), that much of the same community helps nurture. + + + +## Performance + +Performance and efficiency are core value propositions for +DataFusion. While there is still a performance gap between DataFusion best of +breed tightly, integrated systems such as [DuckDB](https://duckdb.org) +and [Polars](https://www.pola.rs/)https://www.pola.rs/), DataFusion is +closing the gap quickly. Performance highlights from the last three +months: + +* XX% Faster Sorting and Merging using the new [Row Format](https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/) +* [Advanced predicate pushdown](https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/), directly on parquet, optionally directly from object storage, enabling sub millisecond filtering, directly from object storage +* Improved `IN` expressions significantly faster Simplify InListExpr ~20-70% Faster ([#4057]) +* Sort and partition aware optimizations such as #3969 and #4691, skipping potentially expensive operations +* Basic filter selectivity analysis (#3868) + + +In the coming few months, we plan work on: +* Improved grouping performance (TODO link) +* bloom filtering +* investigate RLE (Run End Encoding support) (todo Arrow link) +* Enable predicate pushdown by default for all cases +* OTHERS? + +## Runtime Resource Limits + +Initially, DataFusion could potentially use unbounded amounts of memory for certain queries that included Sorts, Grouping or Joins. + +In version 16.0.0, it is possible to limit DataFusion's memory usage for Sorting and Grouping. We are looking for help adding similar limiting for Joins as well as expanding our algorithms to spill to secondary storage, if available. See #3941 fore more detail. + + +## SQL Window Function +[SQL window functions](https://en.wikipedia.org/wiki/Window_function_(SQL)) are useful for a variety of analysis and DataFusion's implementation is close to complete now. + +- Custom window frames such as `... OVER (ORDER BY ... RANGE BETWEEN 0.2 PRECEDING AND 0.2
[GitHub] [arrow-site] kou commented on a diff in pull request #295: [Website] Add links to UKV
kou commented on code in PR #295: URL: https://github.com/apache/arrow-site/pull/295#discussion_r1064587271 ## powered_by.md: ## @@ -184,6 +184,14 @@ short description of your use case. Database Connectivity (ODBC) interface. It provides the ability to return Arrow Tables and RecordBatches in addition to the Python Database API Specification 2.0. +* **[UKV][45]:** Open NoSQL binary database interface, with support for + LevelDB, RocksDB, UDisk, and in-memory Key-Value Stores. It extends + their functionality to support Document Collections, Graphs, and Vector + Search, similar to RedisJSON, RedisGraph, and RediSearch, and brings + familiar structured bindings on top, mimicking tools like Pandas and NetworkX. Review Comment: ```suggestion familiar structured bindings on top, mimicking tools like pandas and NetworkX. ``` ## use_cases.md: ## @@ -64,7 +64,9 @@ The Arrow format also defines a [C data interface]({% post_url 2020-05-04-introd which allows zero-copy data sharing inside a single process without any build-time or link-time dependency requirements. This allows, for example, [R users to access `pyarrow`-based projects]({{ site.baseurl }}/docs/r/articles/python.html) -using the `reticulate` package. +using the `reticulate` package. Similarly, it empowers [UKV](https://unum.cloud/ukv) +to forward persisted data from RocksDB, LevelDB, and UDisk, into Python +runtime and `pyarrow` without copies. Review Comment: Could you revert this? It seems that we use use cases only in Apache Arrow project. ## use_cases.md: ## @@ -81,7 +83,8 @@ and [others]({{ site.baseurl }}/powered_by/) also use Arrow similarly. The Arrow project also defines [Flight]({% post_url 2019-09-30-introducing-arrow-flight %}), a client-server RPC framework to build rich services exchanging data according -to application-defined semantics. +to application-defined semantics. Flight RPC is used by [UKV](https://unum.cloud/ukv) +to exchange tables, documents, and graphs, between server application and client SDKs. Review Comment: Could you revert this? We refer the `powered_by/` page in the above paragraph. UKV is introduced in the page. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] github-actions[bot] commented on pull request #295: [Website] Add links to UKV
github-actions[bot] commented on PR #295: URL: https://github.com/apache/arrow-site/pull/295#issuecomment-1375332242 Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Then could you also rename pull request title in the following format? ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY} See also: * [Other pull requests](https://github.com/apache/arrow-site/pulls/) * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] ashvardanian opened a new pull request, #295: [Website] Add links to UKV
ashvardanian opened a new pull request, #295: URL: https://github.com/apache/arrow-site/pull/295 We have been integrating Apache Arrow across all of our projects during 2022 and hoping to share them with the broader community. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] ozankabak commented on a diff in pull request #294: [WEBSITE] DataFusion 16.0.0 blog post (WIP)
ozankabak commented on code in PR #294: URL: https://github.com/apache/arrow-site/pull/294#discussion_r1064195213 ## _posts/2023-01-07-datafusion-16.0.0.md: ## @@ -0,0 +1,289 @@ +--- +layout: post +title: "Apache Arrow DataFusion 16.0.0 Project Update" +date: "2023-01-07 00:00:00" +author: pmc +categories: [release] +--- + + +# Introduction + +[DataFusion](https://arrow.apache.org/datafusion/) is an extensible +query execution framework, written in [Rust](https://www.rust-lang.org/), +that uses [Apache Arrow](https://arrow.apache.org) as its +in-memory format. It is targeted primarily at developers creating data +intensive analytics, and offers mature +[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html), +a DataFrame API, and many extension points. + +DataFusion based systems perform very well on performance +benchmarks, especially considering they operate on data in parquet +files directly rather than first loading into a specialized format. +Some recent highlights include [clickbench](https://benchmark.clickhouse.com/) +and the [Cloudfuse.io standalone query engines](https://www.cloudfuse.io/dashboards/standalone-engines) page. + +DataFusion is part of a longer term trend, articulated clearly by [Andy Pavlo](http://www.cs.cmu.edu/~pavlo/) in his +[2022 Databases Retrospective](https://ottertune.com/blog/2022-databases-retrospective/). +Database frameworks are proliferating and it is likely that all OLAP DBMSs and other many data heavy applications such as machine learning, will require a vectorized, highly performant query +engine in the next 5 years to remain relevant. +The only practical way to make such technology so widely available +without many millions of dollars of investment is +though open source engine such as DataFusion or [Velox](https://github.com/facebookincubator/velox). + +The rest of this post describes the improvements made to DataFusion +over the last three months and some hints of where we are heading. + +## Community Growth + +The three months since [our last update](https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/) again saw significant growth in the DataFusion. +TODO quantify the growth -- e.g. XXX new contributors to the project and regularly merge YYY PRs a day. + +Growth of new systems based on as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. + +Several new databases built on datafusion (synnada.ai, greptimedb, probably others) Review Comment: Here is what I am aware of: Databases: greptimedb (new), IOx (GA) Data platform: Synnada (new) Use case: Backend for PRQL (relatively new?) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] ozankabak commented on a diff in pull request #294: [WEBSITE] DataFusion 16.0.0 blog post (WIP)
ozankabak commented on code in PR #294: URL: https://github.com/apache/arrow-site/pull/294#discussion_r1064194546 ## _posts/2023-01-07-datafusion-16.0.0.md: ## @@ -0,0 +1,289 @@ +--- +layout: post +title: "Apache Arrow DataFusion 16.0.0 Project Update" +date: "2023-01-07 00:00:00" +author: pmc +categories: [release] +--- + + +# Introduction + +[DataFusion](https://arrow.apache.org/datafusion/) is an extensible +query execution framework, written in [Rust](https://www.rust-lang.org/), +that uses [Apache Arrow](https://arrow.apache.org) as its +in-memory format. It is targeted primarily at developers creating data +intensive analytics, and offers mature +[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html), +a DataFrame API, and many extension points. + +DataFusion based systems perform very well on performance +benchmarks, especially considering they operate on data in parquet +files directly rather than first loading into a specialized format. +Some recent highlights include [clickbench](https://benchmark.clickhouse.com/) +and the [Cloudfuse.io standalone query engines](https://www.cloudfuse.io/dashboards/standalone-engines) page. + +DataFusion is part of a longer term trend, articulated clearly by [Andy Pavlo](http://www.cs.cmu.edu/~pavlo/) in his +[2022 Databases Retrospective](https://ottertune.com/blog/2022-databases-retrospective/). +Database frameworks are proliferating and it is likely that all OLAP DBMSs and other many data heavy applications such as machine learning, will require a vectorized, highly performant query +engine in the next 5 years to remain relevant. +The only practical way to make such technology so widely available +without many millions of dollars of investment is +though open source engine such as DataFusion or [Velox](https://github.com/facebookincubator/velox). + +The rest of this post describes the improvements made to DataFusion +over the last three months and some hints if where we are heading. + +## Community Growth + +The three months since [our last update](https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/) again saw significant growth in the DataFusion. +TODO quantify the growth -- e.g. XXX new contributors to the project and regularly merge YYY PRs a day. + +Growth of new systems based on as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. + +Several new databases built on datafusion (synnada.ai, greptimedb, probably others) +GA of InfluxDB IOx + + +The DataFusion 16.0.0 release consists of 520 PRs from 70 distinct contributors. This does not count all the work that goes into our dependencies such as [arrow](https://crates.io/crates/arrow), [parquet](https://crates.io/crates/parquet), and [object_store](https://crates.io/crates/object_store), that much of the same community helps nurture. + + + +## Performance + +Performance and efficiency are core value propositions for +DataFusion. While there is still a performance gap between DataFusion best of +breed tightly, integrated systems such as [DuckDB](https://duckdb.org) +and [Polars](https://www.pola.rs/)https://www.pola.rs/), DataFusion is +closing the gap quickly. Performance highlights from the last three +months: + +* XX% Faster Sorting and Merging using the new [Row Format](https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/) +* [Advanced predicate pushdown](https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/), directly on parquet, optionally directly from object storage, enabling sub millisecond filtering, directly from object storage +* Improved `IN` expressions significantly faster Simplify InListExpr ~20-70% Faster ([#4057]) +* Sort and partition aware optimizations such as #3969 and #4691, skipping potentially expensive operations +* Basic filter selectivity analysis (#3868) + + +In the coming few months, we plan work on: +* Improved grouping performance (TODO link) +* bloom filtering +* investigate RLE (Run End Encoding support) (todo Arrow link) +* Enable predicate pushdown by default for all cases +* OTHERS? + +## Runtime Resource Limits + +Initially, DataFusion could potentially use unbounded amounts of memory for certain queries that included Sorts, Grouping or Joins. + +In version 16.0.0, it is possible to limit DataFusion's memory usage for Sorting and Grouping. We are looking for help adding similar limiting for Joins as well as expanding our algorithms to spill to secondary storage, if available. See #3941 fore more detail. + + +## SQL Window Function +[SQL window functions](https://en.wikipedia.org/wiki/Window_function_(SQL)) are useful for a variety of analysis and DataFusion's implementation is close to complete now. + +- Custom window frames such as `... OVER (ORDER BY ... RANGE BETWEEN 0.2 PRECEDING AND 0.2
[GitHub] [arrow-site] andygrove commented on a diff in pull request #294: [WEBSITE] DataFusion 16.0.0 blog post (WIP)
andygrove commented on code in PR #294: URL: https://github.com/apache/arrow-site/pull/294#discussion_r1064050185 ## _posts/2023-01-07-datafusion-16.0.0.md: ## @@ -0,0 +1,289 @@ +--- +layout: post +title: "Apache Arrow DataFusion 16.0.0 Project Update" +date: "2023-01-07 00:00:00" +author: pmc +categories: [release] +--- + + +# Introduction + +[DataFusion](https://arrow.apache.org/datafusion/) is an extensible +query execution framework, written in [Rust](https://www.rust-lang.org/), +that uses [Apache Arrow](https://arrow.apache.org) as its +in-memory format. It is targeted primarily at developers creating data +intensive analytics, and offers mature +[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html), +a DataFrame API, and many extension points. + +DataFusion based systems perform very well on performance +benchmarks, especially considering they operate on data in parquet +files directly rather than first loading into a specialized format. +Some recent highlights include [clickbench](https://benchmark.clickhouse.com/) +and the [Cloudfuse.io standalone query engines](https://www.cloudfuse.io/dashboards/standalone-engines) page. + +DataFusion is part of a longer term trend, articulated clearly by [Andy Pavlo](http://www.cs.cmu.edu/~pavlo/) in his +[2022 Databases Retrospective](https://ottertune.com/blog/2022-databases-retrospective/). +Database frameworks are proliferating and it is likely that all OLAP DBMSs and other many data heavy applications such as machine learning, will require a vectorized, highly performant query +engine in the next 5 years to remain relevant. +The only practical way to make such technology so widely available +without many millions of dollars of investment is +though open source engine such as DataFusion or [Velox](https://github.com/facebookincubator/velox). + +The rest of this post describes the improvements made to DataFusion +over the last three months and some hints if where we are heading. Review Comment: ```suggestion over the last three months and some hints of where we are heading. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] ozankabak commented on a diff in pull request #294: [WEBSITE] DataFusion 16.0.0 blog post (WIP)
ozankabak commented on code in PR #294: URL: https://github.com/apache/arrow-site/pull/294#discussion_r1064040183 ## _posts/2023-01-07-datafusion-16.0.0.md: ## @@ -0,0 +1,289 @@ +--- +layout: post +title: "Apache Arrow DataFusion 16.0.0 Project Update" +date: "2023-01-07 00:00:00" +author: pmc +categories: [release] +--- + + +# Introduction + +[DataFusion](https://arrow.apache.org/datafusion/) is an extensible +query execution framework, written in [Rust](https://www.rust-lang.org/), +that uses [Apache Arrow](https://arrow.apache.org) as its +in-memory format. It is targeted primarily at developers creating data +intensive analytics, and offers mature +[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html), +a DataFrame API, and many extension points. + +DataFusion based systems perform very well on performance +benchmarks, especially considering they operate on data in parquet +files directly rather than first loading into a specialized format. +Some recent highlights include [clickbench](https://benchmark.clickhouse.com/) +and the [Cloudfuse.io standalone query engines](https://www.cloudfuse.io/dashboards/standalone-engines) page. + +DataFusion is part of a longer term trend, articulated clearly by [Andy Pavlo](http://www.cs.cmu.edu/~pavlo/) in his +[2022 Databases Retrospective](https://ottertune.com/blog/2022-databases-retrospective/). +Database frameworks are proliferating and it is likely that all OLAP DBMSs and other many data heavy applications such as machine learning, will require a vectorized, highly performant query +engine in the next 5 years to remain relevant. +The only practical way to make such technology so widely available +without many millions of dollars of investment is +though open source engine such as DataFusion or [Velox](https://github.com/facebookincubator/velox). + +The rest of this post describes the improvements made to DataFusion +over the last three months and some hints if where we are heading. + +## Community Growth + +The three months since [our last update](https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/) again saw significant growth in the DataFusion. +TODO quantify the growth -- e.g. XXX new contributors to the project and regularly merge YYY PRs a day. + +Growth of new systems based on as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. + +Several new databases built on datafusion (synnada.ai, greptimedb, probably others) +GA of InfluxDB IOx + + +The DataFusion 16.0.0 release consists of 520 PRs from 70 distinct contributors. This does not count all the work that goes into our dependencies such as [arrow](https://crates.io/crates/arrow), [parquet](https://crates.io/crates/parquet), and [object_store](https://crates.io/crates/object_store), that much of the same community helps nurture. + + + +## Performance + +Performance and efficiency are core value propositions for +DataFusion. While there is still a performance gap between DataFusion best of +breed tightly, integrated systems such as [DuckDB](https://duckdb.org) +and [Polars](https://www.pola.rs/)https://www.pola.rs/), DataFusion is +closing the gap quickly. Performance highlights from the last three +months: + +* XX% Faster Sorting and Merging using the new [Row Format](https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/) +* [Advanced predicate pushdown](https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/), directly on parquet, optionally directly from object storage, enabling sub millisecond filtering, directly from object storage +* Improved `IN` expressions significantly faster Simplify InListExpr ~20-70% Faster ([#4057]) +* Sort and partition aware optimizations such as #3969 and #4691, skipping potentially expensive operations +* Basic filter selectivity analysis (#3868) + + +In the coming few months, we plan work on: +* Improved grouping performance (TODO link) +* bloom filtering +* investigate RLE (Run End Encoding support) (todo Arrow link) +* Enable predicate pushdown by default for all cases +* OTHERS? + +## Runtime Resource Limits + +Initially, DataFusion could potentially use unbounded amounts of memory for certain queries that included Sorts, Grouping or Joins. + +In version 16.0.0, it is possible to limit DataFusion's memory usage for Sorting and Grouping. We are looking for help adding similar limiting for Joins as well as expanding our algorithms to spill to secondary storage, if available. See #3941 fore more detail. + + +## SQL Window Function +[SQL window functions](https://en.wikipedia.org/wiki/Window_function_(SQL)) are useful for a variety of analysis and DataFusion's implementation is close to complete now. + +- Custom window frames such as `... OVER (ORDER BY ... RANGE BETWEEN 0.2 PRECEDING AND 0.2
[GitHub] [arrow-site] alamb commented on a diff in pull request #294: [WEBSITE] DataFusion 16.0.0 blog post (WIP)
alamb commented on code in PR #294: URL: https://github.com/apache/arrow-site/pull/294#discussion_r1064019922 ## _posts/2023-01-07-datafusion-16.0.0.md: ## @@ -0,0 +1,289 @@ +--- +layout: post +title: "Apache Arrow DataFusion 16.0.0 Project Update" +date: "2023-01-07 00:00:00" +author: pmc +categories: [release] +--- + + +# Introduction + +[DataFusion](https://arrow.apache.org/datafusion/) is an extensible +query execution framework, written in [Rust](https://www.rust-lang.org/), +that uses [Apache Arrow](https://arrow.apache.org) as its +in-memory format. It is targeted primarily at developers creating data +intensive analytics, and offers mature +[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html), +a DataFrame API, and many extension points. + +DataFusion based systems perform very well on performance +benchmarks, especially considering they operate on data in parquet +files directly rather than first loading into a specialized format. +Some recent highlights include [clickbench](https://benchmark.clickhouse.com/) +and the [Cloudfuse.io standalone query engines](https://www.cloudfuse.io/dashboards/standalone-engines) page. + +DataFusion is part of a longer term trend, articulated clearly by [Andy Pavlo](http://www.cs.cmu.edu/~pavlo/) in his +[2022 Databases Retrospective](https://ottertune.com/blog/2022-databases-retrospective/). +Database frameworks are proliferating and it is likely that all OLAP DBMSs and other many data heavy applications such as machine learning, will require a vectorized, highly performant query +engine in the next 5 years to remain relevant. +The only practical way to make such technology so widely available +without many millions of dollars of investment is +though open source engine such as DataFusion or [Velox](https://github.com/facebookincubator/velox). + +The rest of this post describes the improvements made to DataFusion +over the last three months and some hints if where we are heading. + +## Community Growth + +The three months since [our last update](https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/) again saw significant growth in the DataFusion. +TODO quantify the growth -- e.g. XXX new contributors to the project and regularly merge YYY PRs a day. + +Growth of new systems based on as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. + +Several new databases built on datafusion (synnada.ai, greptimedb, probably others) +GA of InfluxDB IOx + + +The DataFusion 16.0.0 release consists of 520 PRs from 70 distinct contributors. This does not count all the work that goes into our dependencies such as [arrow](https://crates.io/crates/arrow), [parquet](https://crates.io/crates/parquet), and [object_store](https://crates.io/crates/object_store), that much of the same community helps nurture. + + + +## Performance + +Performance and efficiency are core value propositions for +DataFusion. While there is still a performance gap between DataFusion best of +breed tightly, integrated systems such as [DuckDB](https://duckdb.org) +and [Polars](https://www.pola.rs/)https://www.pola.rs/), DataFusion is +closing the gap quickly. Performance highlights from the last three +months: + +* XX% Faster Sorting and Merging using the new [Row Format](https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/) Review Comment: @tustvold do you have any suggstions about what numbers to use here? ## _posts/2023-01-07-datafusion-16.0.0.md: ## @@ -0,0 +1,289 @@ +--- +layout: post +title: "Apache Arrow DataFusion 16.0.0 Project Update" +date: "2023-01-07 00:00:00" +author: pmc +categories: [release] +--- + + +# Introduction + +[DataFusion](https://arrow.apache.org/datafusion/) is an extensible +query execution framework, written in [Rust](https://www.rust-lang.org/), +that uses [Apache Arrow](https://arrow.apache.org) as its +in-memory format. It is targeted primarily at developers creating data +intensive analytics, and offers mature +[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html), +a DataFrame API, and many extension points. + +DataFusion based systems perform very well on performance +benchmarks, especially considering they operate on data in parquet +files directly rather than first loading into a specialized format. +Some recent highlights include [clickbench](https://benchmark.clickhouse.com/) +and the [Cloudfuse.io standalone query engines](https://www.cloudfuse.io/dashboards/standalone-engines) page. + +DataFusion is part of a longer term trend, articulated clearly by [Andy Pavlo](http://www.cs.cmu.edu/~pavlo/) in his +[2022 Databases Retrospective](https://ottertune.com/blog/2022-databases-retrospective/). +Database frameworks are proliferating and it is likely that all OLAP DBMSs and other many data heavy
[GitHub] [arrow-site] alamb commented on pull request #294: [WEBSITE] DataFusion 16.0.0 blog post (WIP)
alamb commented on PR #294: URL: https://github.com/apache/arrow-site/pull/294#issuecomment-1374526999 It is a work in progress, but I think it is no coherent enough to gather some more inout -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] github-actions[bot] commented on pull request #294: [WEBSITE] DataFusion 16.0.0 blog post (WIP)
github-actions[bot] commented on PR #294: URL: https://github.com/apache/arrow-site/pull/294#issuecomment-1374526901 Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Then could you also rename pull request title in the following format? ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY} See also: * [Other pull requests](https://github.com/apache/arrow-site/pulls/) * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] alamb opened a new pull request, #294: [WEBSITE] DataFusion 16.0.0 blog post
alamb opened a new pull request, #294: URL: https://github.com/apache/arrow-site/pull/294 Closes https://github.com/apache/arrow-datafusion/issues/4804 This blog post highlights some improvements and features in DataFusion the last 3 releases -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] lidavidm merged pull request #293: MINOR: Fix typo in ADBC post
lidavidm merged PR #293: URL: https://github.com/apache/arrow-site/pull/293 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] github-actions[bot] commented on pull request #293: MINOR: Fix typo in ADBC post
github-actions[bot] commented on PR #293: URL: https://github.com/apache/arrow-site/pull/293#issuecomment-1372769751 Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Then could you also rename pull request title in the following format? ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY} See also: * [Other pull requests](https://github.com/apache/arrow-site/pulls/) * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] lidavidm merged pull request #248: [Website] Add ADBC blog post
lidavidm merged PR #248: URL: https://github.com/apache/arrow-site/pull/248 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] lidavidm commented on pull request #248: [Website] Add ADBC blog post
lidavidm commented on PR #248: URL: https://github.com/apache/arrow-site/pull/248#issuecomment-1372723648 I'll be publishing this in a bit. Thanks to all who reviewed! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] lidavidm commented on pull request #248: [Website] Add ADBC blog post
lidavidm commented on PR #248: URL: https://github.com/apache/arrow-site/pull/248#issuecomment-1369084955 Updated, thanks Ian! I tweaked the diagram too. Updated preview: https://dynamic-jalebi-94dd65.netlify.app/blog/2023/01/04/introducing-arrow-adbc/ (since I had to bump the date forward) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] lidavidm commented on a diff in pull request #248: [Website] Add ADBC blog post
lidavidm commented on code in PR #248: URL: https://github.com/apache/arrow-site/pull/248#discussion_r1060123304 ## img/ADBC.svg: ## @@ -0,0 +1 @@ +http://www.w3.org/2000/svg; xmlns:xlink="http://www.w3.org/1999/xlink; xmlns:lucid="lucid" width="800" height="600"> Review Comment: I meant something like that, yeah. The database only has to implement one (columnar, Arrow-based) endpoint/protocol but can support Arrow-native and 'traditional' clients. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] ianmcook commented on a diff in pull request #248: [Website] Add ADBC blog post
ianmcook commented on code in PR #248: URL: https://github.com/apache/arrow-site/pull/248#discussion_r1060116272 ## _posts/2022-12-31-arrow-adbc.md: ## @@ -0,0 +1,217 @@ +--- +layout: post +title: "Introducing ADBC: Database Access for Apache Arrow" +date: "2022-12-31 00:00:00" +author: pmc +categories: [application] +--- + + +The Arrow community would like to introduce version 1.0.0 of the [Arrow Database Connectivity (ADBC)][adbc] specification. +ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical applications. +Or in other words: **ADBC is a single API for getting Arrow data in and out of different databases**. + +## Motivation + +Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to work with databases. +That way, they can code to the same API regardless of the underlying database, saving on development time. +Roughly speaking, when an application executes a query with these APIs: + + + + The query execution flow. + + +1. The application submits a SQL query via the JDBC/ODBC API. +2. The query is passed on to the driver. +3. The driver translates the query to a database-specific protocol and sends it to the database. +4. The database executes the query and returns the result set in a database-specific format. +5. The driver translates the result format into the JDBC/ODBC API. +6. The application iterates over the result rows using the JDBC/ODBC API. + +When columnar data comes into play, however, problems arise. +JDBC is a row-oriented API, and while ODBC can support columnar data, the type system and data representation is not a perfect match with Arrow. +So generally, columnar data must be converted to rows between steps 5 and 6, spending resources without performing "useful" work. + +This mismatch is problematic for columnar database systems, such as ClickHouse, Dremio, DuckDB, and Google BigQuery. +On the client side, tools such as Apache Spark and pandas would be better off getting columnar data directly, skipping that conversion. +Otherwise, they're leaving performance on the table. +At the same time, that conversion isn't always avoidable. +Row-oriented database systems like PostgreSQL aren't going away, and these clients will still want to consume data from them. + +Developers have a few options: + +- *Just use JDBC/ODBC*. + These standards are here to stay, and it makes sense for databases to support them for applications that want them. + But when both the database and the application are columnar, that means converting data into rows for JDBC/ODBC, only for the client to convert them right back into columns! + Performance suffers, and developers have to spend time implementing the conversions. +- *Use JDBC/ODBC-to-Arrow conversion libraries*. + Libraries like [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc] handle row to columnar conversions for clients. + But this doesn't fundamentally solve the problem. + Unnecessary data conversions are still required. +- *Use vendor-specific protocols*. + For some databases, applications can use a database-specific protocol or SDK to directly get Arrow data. + For example, applications could use Dremio via [Arrow Flight SQL][flight-sql]. + But client applications that want to use multiple database vendors would need to integrate with each of them. + (Look at all the [connectors](https://trino.io/docs/current/connector.html) that Trino implements.) + And databases like PostgreSQL don't offer an option supporting Arrow in the first place. + +As is, clients must choose between either tedious integration work or leaving performance on the table. We can make this better. + +## Introducing ADBC + +ADBC is an Arrow-based, vendor-netural API for interacting with databases. +Applications that use ADBC just get Arrow data. +They don't have to do any conversions themselves, and they don't have to integrate each database's specific SDK. + +Just like JDBC/ODBC, underneath the ADBC API are drivers that translate the API for specific databases. + +* A driver for an Arrow-native database just passes Arrow data through without conversion. +* A driver for a non-Arrow-native database must convert the data to Arrow. + This saves the application from doing that, and the driver can optimize the conversion for its database. + + + + The query execution flow with two different ADBC drivers. + + +1. The application submits a SQL query via the ADBC API. +2. The query is passed on to the ADBC driver. +3. The driver translates the query to a database-specific protocol and sends the query to the database. +4. The database executes the query and returns the result set in a database-specific format, which is ideally Arrow data. +5. If needed: the driver translates the result into Arrow data. +6. The application iterates over batches of Arrow data. + +The application only deals with one API, and only works with Arrow data. + +For example, in Python, the ADBC packages offer
[GitHub] [arrow-site] ianmcook commented on a diff in pull request #248: [Website] Add ADBC blog post
ianmcook commented on code in PR #248: URL: https://github.com/apache/arrow-site/pull/248#discussion_r1060114790 ## _posts/2022-12-31-arrow-adbc.md: ## @@ -0,0 +1,217 @@ +--- +layout: post +title: "Introducing ADBC: Database Access for Apache Arrow" +date: "2022-12-31 00:00:00" +author: pmc +categories: [application] +--- + + +The Arrow community would like to introduce version 1.0.0 of the [Arrow Database Connectivity (ADBC)][adbc] specification. +ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical applications. +Or in other words: **ADBC is a single API for getting Arrow data in and out of different databases**. + +## Motivation + +Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to work with databases. +That way, they can code to the same API regardless of the underlying database, saving on development time. +Roughly speaking, when an application executes a query with these APIs: + + + + The query execution flow. + + +1. The application submits a SQL query via the JDBC/ODBC API. +2. The query is passed on to the driver. +3. The driver translates the query to a database-specific protocol and sends it to the database. +4. The database executes the query and returns the result set in a database-specific format. +5. The driver translates the result format into the JDBC/ODBC API. +6. The application iterates over the result rows using the JDBC/ODBC API. + +When columnar data comes into play, however, problems arise. +JDBC is a row-oriented API, and while ODBC can support columnar data, the type system and data representation is not a perfect match with Arrow. +So generally, columnar data must be converted to rows between steps 5 and 6, spending resources without performing "useful" work. + +This mismatch is problematic for columnar database systems, such as ClickHouse, Dremio, DuckDB, and Google BigQuery. +On the client side, tools such as Apache Spark and pandas would be better off getting columnar data directly, skipping that conversion. +Otherwise, they're leaving performance on the table. +At the same time, that conversion isn't always avoidable. +Row-oriented database systems like PostgreSQL aren't going away, and these clients will still want to consume data from them. + +Developers have a few options: + +- *Just use JDBC/ODBC*. + These standards are here to stay, and it makes sense for databases to support them for applications that want them. + But when both the database and the application are columnar, that means converting data into rows for JDBC/ODBC, only for the client to convert them right back into columns! + Performance suffers, and developers have to spend time implementing the conversions. +- *Use JDBC/ODBC-to-Arrow conversion libraries*. + Libraries like [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc] handle row to columnar conversions for clients. + But this doesn't fundamentally solve the problem. + Unnecessary data conversions are still required. +- *Use vendor-specific protocols*. + For some databases, applications can use a database-specific protocol or SDK to directly get Arrow data. + For example, applications could use Dremio via [Arrow Flight SQL][flight-sql]. + But client applications that want to use multiple database vendors would need to integrate with each of them. + (Look at all the [connectors](https://trino.io/docs/current/connector.html) that Trino implements.) + And databases like PostgreSQL don't offer an option supporting Arrow in the first place. + +As is, clients must choose between either tedious integration work or leaving performance on the table. We can make this better. + +## Introducing ADBC + +ADBC is an Arrow-based, vendor-netural API for interacting with databases. +Applications that use ADBC just get Arrow data. +They don't have to do any conversions themselves, and they don't have to integrate each database's specific SDK. + +Just like JDBC/ODBC, underneath the ADBC API are drivers that translate the API for specific databases. + +* A driver for an Arrow-native database just passes Arrow data through without conversion. +* A driver for a non-Arrow-native database must convert the data to Arrow. + This saves the application from doing that, and the driver can optimize the conversion for its database. + + + + The query execution flow with two different ADBC drivers. + + +1. The application submits a SQL query via the ADBC API. +2. The query is passed on to the ADBC driver. +3. The driver translates the query to a database-specific protocol and sends the query to the database. +4. The database executes the query and returns the result set in a database-specific format, which is ideally Arrow data. +5. If needed: the driver translates the result into Arrow data. +6. The application iterates over batches of Arrow data. + +The application only deals with one API, and only works with Arrow data. + +For example, in Python, the ADBC packages offer
[GitHub] [arrow-site] ianmcook commented on a diff in pull request #248: [Website] Add ADBC blog post
ianmcook commented on code in PR #248: URL: https://github.com/apache/arrow-site/pull/248#discussion_r1060114466 ## _posts/2022-12-31-arrow-adbc.md: ## @@ -0,0 +1,217 @@ +--- +layout: post +title: "Introducing ADBC: Database Access for Apache Arrow" +date: "2022-12-31 00:00:00" +author: pmc +categories: [application] +--- + + +The Arrow community would like to introduce version 1.0.0 of the [Arrow Database Connectivity (ADBC)][adbc] specification. +ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical applications. +Or in other words: **ADBC is a single API for getting Arrow data in and out of different databases**. + +## Motivation + +Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to work with databases. +That way, they can code to the same API regardless of the underlying database, saving on development time. +Roughly speaking, when an application executes a query with these APIs: + + + + The query execution flow. + + +1. The application submits a SQL query via the JDBC/ODBC API. +2. The query is passed on to the driver. +3. The driver translates the query to a database-specific protocol and sends it to the database. +4. The database executes the query and returns the result set in a database-specific format. +5. The driver translates the result format into the JDBC/ODBC API. +6. The application iterates over the result rows using the JDBC/ODBC API. + +When columnar data comes into play, however, problems arise. +JDBC is a row-oriented API, and while ODBC can support columnar data, the type system and data representation is not a perfect match with Arrow. +So generally, columnar data must be converted to rows between steps 5 and 6, spending resources without performing "useful" work. + +This mismatch is problematic for columnar database systems, such as ClickHouse, Dremio, DuckDB, and Google BigQuery. +On the client side, tools such as Apache Spark and pandas would be better off getting columnar data directly, skipping that conversion. +Otherwise, they're leaving performance on the table. +At the same time, that conversion isn't always avoidable. +Row-oriented database systems like PostgreSQL aren't going away, and these clients will still want to consume data from them. + +Developers have a few options: + +- *Just use JDBC/ODBC*. + These standards are here to stay, and it makes sense for databases to support them for applications that want them. + But when both the database and the application are columnar, that means converting data into rows for JDBC/ODBC, only for the client to convert them right back into columns! + Performance suffers, and developers have to spend time implementing the conversions. +- *Use JDBC/ODBC-to-Arrow conversion libraries*. + Libraries like [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc] handle row to columnar conversions for clients. + But this doesn't fundamentally solve the problem. + Unnecessary data conversions are still required. +- *Use vendor-specific protocols*. + For some databases, applications can use a database-specific protocol or SDK to directly get Arrow data. + For example, applications could use Dremio via [Arrow Flight SQL][flight-sql]. + But client applications that want to use multiple database vendors would need to integrate with each of them. + (Look at all the [connectors](https://trino.io/docs/current/connector.html) that Trino implements.) + And databases like PostgreSQL don't offer an option supporting Arrow in the first place. + +As is, clients must choose between either tedious integration work or leaving performance on the table. We can make this better. + +## Introducing ADBC + +ADBC is an Arrow-based, vendor-netural API for interacting with databases. +Applications that use ADBC just get Arrow data. +They don't have to do any conversions themselves, and they don't have to integrate each database's specific SDK. + +Just like JDBC/ODBC, underneath the ADBC API are drivers that translate the API for specific databases. + +* A driver for an Arrow-native database just passes Arrow data through without conversion. +* A driver for a non-Arrow-native database must convert the data to Arrow. + This saves the application from doing that, and the driver can optimize the conversion for its database. + + + + The query execution flow with two different ADBC drivers. + + +1. The application submits a SQL query via the ADBC API. +2. The query is passed on to the ADBC driver. +3. The driver translates the query to a database-specific protocol and sends the query to the database. +4. The database executes the query and returns the result set in a database-specific format, which is ideally Arrow data. +5. If needed: the driver translates the result into Arrow data. +6. The application iterates over batches of Arrow data. + +The application only deals with one API, and only works with Arrow data. + +For example, in Python, the ADBC packages offer
[GitHub] [arrow-site] ianmcook commented on a diff in pull request #248: [Website] Add ADBC blog post
ianmcook commented on code in PR #248: URL: https://github.com/apache/arrow-site/pull/248#discussion_r1060113919 ## img/ADBC.svg: ## @@ -0,0 +1 @@ +http://www.w3.org/2000/svg; xmlns:xlink="http://www.w3.org/1999/xlink; xmlns:lucid="lucid" width="800" height="600"> Review Comment: This text in the diagram seems confusing: >The database only works with Arrow data, regardless of the actual client. The database does not necessarily "work with" Arrow data. It (ultimately) emits Arrow data, but internally might be working with the data in some other database-specific format. Is there a better way to express what you mean by this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] ianmcook commented on a diff in pull request #248: [Website] Add ADBC blog post
ianmcook commented on code in PR #248: URL: https://github.com/apache/arrow-site/pull/248#discussion_r1060110516 ## _posts/2022-12-31-arrow-adbc.md: ## @@ -0,0 +1,217 @@ +--- +layout: post +title: "Introducing ADBC: Database Access for Apache Arrow" +date: "2022-12-31 00:00:00" +author: pmc +categories: [application] +--- + + +The Arrow community would like to introduce version 1.0.0 of the [Arrow Database Connectivity (ADBC)][adbc] specification. +ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical applications. +Or in other words: **ADBC is a single API for getting Arrow data in and out of different databases**. + +## Motivation + +Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to work with databases. +That way, they can code to the same API regardless of the underlying database, saving on development time. +Roughly speaking, when an application executes a query with these APIs: + + + + The query execution flow. + + +1. The application submits a SQL query via the JDBC/ODBC API. +2. The query is passed on to the driver. +3. The driver translates the query to a database-specific protocol and sends it to the database. +4. The database executes the query and returns the result set in a database-specific format. +5. The driver translates the result format into the JDBC/ODBC API. +6. The application iterates over the result rows using the JDBC/ODBC API. + +When columnar data comes into play, however, problems arise. +JDBC is a row-oriented API, and while ODBC can support columnar data, the type system and data representation is not a perfect match with Arrow. +So generally, columnar data must be converted to rows between steps 5 and 6, spending resources without performing "useful" work. + +This mismatch is problematic for columnar database systems, such as ClickHouse, Dremio, DuckDB, and Google BigQuery. +On the client side, tools such as Apache Spark and pandas would be better off getting columnar data directly, skipping that conversion. +Otherwise, they're leaving performance on the table. +At the same time, that conversion isn't always avoidable. +Row-oriented database systems like PostgreSQL aren't going away, and these clients will still want to consume data from them. + +Developers have a few options: + +- *Just use JDBC/ODBC*. + These standards are here to stay, and it makes sense for databases to support them for applications that want them. + But when both the database and the application are columnar, that means converting data into rows for JDBC/ODBC, only for the client to convert them right back into columns! + Performance suffers, and developers have to spend time implementing the conversions. +- *Use JDBC/ODBC-to-Arrow conversion libraries*. + Libraries like [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc] handle row to columnar conversions for clients. + But this doesn't fundamentally solve the problem. + Unnecessary data conversions are still required. +- *Use vendor-specific protocols*. + For some databases, applications can use a database-specific protocol or SDK to directly get Arrow data. + For example, applications could use Dremio via [Arrow Flight SQL][flight-sql]. + But client applications that want to use multiple database vendors would need to integrate with each of them. + (Look at all the [connectors](https://trino.io/docs/current/connector.html) that Trino implements.) + And databases like PostgreSQL don't offer an option supporting Arrow in the first place. + +As is, clients must choose between either tedious integration work or leaving performance on the table. We can make this better. + +## Introducing ADBC + +ADBC is an Arrow-based, vendor-netural API for interacting with databases. +Applications that use ADBC just get Arrow data. +They don't have to do any conversions themselves, and they don't have to integrate each database's specific SDK. + +Just like JDBC/ODBC, underneath the ADBC API are drivers that translate the API for specific databases. + +* A driver for an Arrow-native database just passes Arrow data through without conversion. +* A driver for a non-Arrow-native database must convert the data to Arrow. + This saves the application from doing that, and the driver can optimize the conversion for its database. + + + + The query execution flow with two different ADBC drivers. + + +1. The application submits a SQL query via the ADBC API. +2. The query is passed on to the ADBC driver. +3. The driver translates the query to a database-specific protocol and sends the query to the database. +4. The database executes the query and returns the result set in a database-specific format, which is ideally Arrow data. +5. If needed: the driver translates the result into Arrow data. +6. The application iterates over batches of Arrow data. + +The application only deals with one API, and only works with Arrow data. + +For example, in Python, the ADBC packages offer
[GitHub] [arrow-site] ianmcook commented on a diff in pull request #248: [Website] Add ADBC blog post
ianmcook commented on code in PR #248: URL: https://github.com/apache/arrow-site/pull/248#discussion_r1060105359 ## _posts/2022-12-31-arrow-adbc.md: ## @@ -0,0 +1,217 @@ +--- +layout: post +title: "Introducing ADBC: Database Access for Apache Arrow" +date: "2022-12-31 00:00:00" +author: pmc +categories: [application] +--- + + +The Arrow community would like to introduce version 1.0.0 of the [Arrow Database Connectivity (ADBC)][adbc] specification. +ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical applications. +Or in other words: **ADBC is a single API for getting Arrow data in and out of different databases**. + +## Motivation + +Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to work with databases. +That way, they can code to the same API regardless of the underlying database, saving on development time. +Roughly speaking, when an application executes a query with these APIs: + + + + The query execution flow. + + +1. The application submits a SQL query via the JDBC/ODBC API. +2. The query is passed on to the driver. +3. The driver translates the query to a database-specific protocol and sends it to the database. +4. The database executes the query and returns the result set in a database-specific format. +5. The driver translates the result format into the JDBC/ODBC API. +6. The application iterates over the result rows using the JDBC/ODBC API. + +When columnar data comes into play, however, problems arise. +JDBC is a row-oriented API, and while ODBC can support columnar data, the type system and data representation is not a perfect match with Arrow. +So generally, columnar data must be converted to rows between steps 5 and 6, spending resources without performing "useful" work. + +This mismatch is problematic for columnar database systems, such as ClickHouse, Dremio, DuckDB, and Google BigQuery. +On the client side, tools such as Apache Spark and pandas would be better off getting columnar data directly, skipping that conversion. +Otherwise, they're leaving performance on the table. +At the same time, that conversion isn't always avoidable. +Row-oriented database systems like PostgreSQL aren't going away, and these clients will still want to consume data from them. + +Developers have a few options: + +- *Just use JDBC/ODBC*. + These standards are here to stay, and it makes sense for databases to support them for applications that want them. + But when both the database and the application are columnar, that means converting data into rows for JDBC/ODBC, only for the client to convert them right back into columns! + Performance suffers, and developers have to spend time implementing the conversions. +- *Use JDBC/ODBC-to-Arrow conversion libraries*. + Libraries like [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc] handle row to columnar conversions for clients. + But this doesn't fundamentally solve the problem. + Unnecessary data conversions are still required. +- *Use vendor-specific protocols*. + For some databases, applications can use a database-specific protocol or SDK to directly get Arrow data. + For example, applications could use Dremio via [Arrow Flight SQL][flight-sql]. + But client applications that want to use multiple database vendors would need to integrate with each of them. + (Look at all the [connectors](https://trino.io/docs/current/connector.html) that Trino implements.) + And databases like PostgreSQL don't offer an option supporting Arrow in the first place. + +As is, clients must choose between either tedious integration work or leaving performance on the table. We can make this better. + +## Introducing ADBC + +ADBC is an Arrow-based, vendor-netural API for interacting with databases. +Applications that use ADBC just get Arrow data. Review Comment: Or "simply receive" if you want to emphasize that no conversion is necessary. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] ianmcook commented on a diff in pull request #248: [Website] Add ADBC blog post
ianmcook commented on code in PR #248: URL: https://github.com/apache/arrow-site/pull/248#discussion_r1060104986 ## _posts/2022-12-31-arrow-adbc.md: ## @@ -0,0 +1,217 @@ +--- +layout: post +title: "Introducing ADBC: Database Access for Apache Arrow" +date: "2022-12-31 00:00:00" +author: pmc +categories: [application] +--- + + +The Arrow community would like to introduce version 1.0.0 of the [Arrow Database Connectivity (ADBC)][adbc] specification. +ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical applications. +Or in other words: **ADBC is a single API for getting Arrow data in and out of different databases**. + +## Motivation + +Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to work with databases. +That way, they can code to the same API regardless of the underlying database, saving on development time. +Roughly speaking, when an application executes a query with these APIs: + + + + The query execution flow. + + +1. The application submits a SQL query via the JDBC/ODBC API. +2. The query is passed on to the driver. +3. The driver translates the query to a database-specific protocol and sends it to the database. +4. The database executes the query and returns the result set in a database-specific format. +5. The driver translates the result format into the JDBC/ODBC API. +6. The application iterates over the result rows using the JDBC/ODBC API. + +When columnar data comes into play, however, problems arise. +JDBC is a row-oriented API, and while ODBC can support columnar data, the type system and data representation is not a perfect match with Arrow. +So generally, columnar data must be converted to rows between steps 5 and 6, spending resources without performing "useful" work. + +This mismatch is problematic for columnar database systems, such as ClickHouse, Dremio, DuckDB, and Google BigQuery. +On the client side, tools such as Apache Spark and pandas would be better off getting columnar data directly, skipping that conversion. +Otherwise, they're leaving performance on the table. +At the same time, that conversion isn't always avoidable. +Row-oriented database systems like PostgreSQL aren't going away, and these clients will still want to consume data from them. + +Developers have a few options: + +- *Just use JDBC/ODBC*. + These standards are here to stay, and it makes sense for databases to support them for applications that want them. + But when both the database and the application are columnar, that means converting data into rows for JDBC/ODBC, only for the client to convert them right back into columns! + Performance suffers, and developers have to spend time implementing the conversions. +- *Use JDBC/ODBC-to-Arrow conversion libraries*. + Libraries like [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc] handle row to columnar conversions for clients. + But this doesn't fundamentally solve the problem. + Unnecessary data conversions are still required. +- *Use vendor-specific protocols*. + For some databases, applications can use a database-specific protocol or SDK to directly get Arrow data. + For example, applications could use Dremio via [Arrow Flight SQL][flight-sql]. + But client applications that want to use multiple database vendors would need to integrate with each of them. + (Look at all the [connectors](https://trino.io/docs/current/connector.html) that Trino implements.) + And databases like PostgreSQL don't offer an option supporting Arrow in the first place. + +As is, clients must choose between either tedious integration work or leaving performance on the table. We can make this better. + +## Introducing ADBC + +ADBC is an Arrow-based, vendor-netural API for interacting with databases. +Applications that use ADBC just get Arrow data. Review Comment: ```suggestion Applications that use ADBC receive Arrow data. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] ianmcook commented on a diff in pull request #248: [Website] Add ADBC blog post
ianmcook commented on code in PR #248: URL: https://github.com/apache/arrow-site/pull/248#discussion_r1060103981 ## _posts/2022-12-31-arrow-adbc.md: ## @@ -0,0 +1,217 @@ +--- +layout: post +title: "Introducing ADBC: Database Access for Apache Arrow" +date: "2022-12-31 00:00:00" +author: pmc +categories: [application] +--- + + +The Arrow community would like to introduce version 1.0.0 of the [Arrow Database Connectivity (ADBC)][adbc] specification. +ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical applications. +Or in other words: **ADBC is a single API for getting Arrow data in and out of different databases**. + +## Motivation + +Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to work with databases. +That way, they can code to the same API regardless of the underlying database, saving on development time. +Roughly speaking, when an application executes a query with these APIs: + + + + The query execution flow. + + +1. The application submits a SQL query via the JDBC/ODBC API. +2. The query is passed on to the driver. +3. The driver translates the query to a database-specific protocol and sends it to the database. +4. The database executes the query and returns the result set in a database-specific format. +5. The driver translates the result format into the JDBC/ODBC API. +6. The application iterates over the result rows using the JDBC/ODBC API. + +When columnar data comes into play, however, problems arise. +JDBC is a row-oriented API, and while ODBC can support columnar data, the type system and data representation is not a perfect match with Arrow. +So generally, columnar data must be converted to rows between steps 5 and 6, spending resources without performing "useful" work. + +This mismatch is problematic for columnar database systems, such as ClickHouse, Dremio, DuckDB, and Google BigQuery. +On the client side, tools such as Apache Spark and pandas would be better off getting columnar data directly, skipping that conversion. +Otherwise, they're leaving performance on the table. +At the same time, that conversion isn't always avoidable. +Row-oriented database systems like PostgreSQL aren't going away, and these clients will still want to consume data from them. + +Developers have a few options: + +- *Just use JDBC/ODBC*. + These standards are here to stay, and it makes sense for databases to support them for applications that want them. + But when both the database and the application are columnar, that means converting data into rows for JDBC/ODBC, only for the client to convert them right back into columns! + Performance suffers, and developers have to spend time implementing the conversions. +- *Use JDBC/ODBC-to-Arrow conversion libraries*. + Libraries like [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc] handle row to columnar conversions for clients. + But this doesn't fundamentally solve the problem. + Unnecessary data conversions are still required. +- *Use vendor-specific protocols*. + For some databases, applications can use a database-specific protocol or SDK to directly get Arrow data. + For example, applications could use Dremio via [Arrow Flight SQL][flight-sql]. + But client applications that want to use multiple database vendors would need to integrate with each of them. Review Comment: ```suggestion But client applications that want to support multiple database vendors would need to integrate with each of them. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] ianmcook commented on a diff in pull request #248: [Website] Add ADBC blog post
ianmcook commented on code in PR #248: URL: https://github.com/apache/arrow-site/pull/248#discussion_r1060103099 ## _posts/2022-12-31-arrow-adbc.md: ## @@ -0,0 +1,217 @@ +--- +layout: post +title: "Introducing ADBC: Database Access for Apache Arrow" +date: "2022-12-31 00:00:00" +author: pmc +categories: [application] +--- + + +The Arrow community would like to introduce version 1.0.0 of the [Arrow Database Connectivity (ADBC)][adbc] specification. +ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical applications. +Or in other words: **ADBC is a single API for getting Arrow data in and out of different databases**. + +## Motivation + +Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to work with databases. +That way, they can code to the same API regardless of the underlying database, saving on development time. +Roughly speaking, when an application executes a query with these APIs: + + + + The query execution flow. + + +1. The application submits a SQL query via the JDBC/ODBC API. +2. The query is passed on to the driver. +3. The driver translates the query to a database-specific protocol and sends it to the database. +4. The database executes the query and returns the result set in a database-specific format. +5. The driver translates the result format into the JDBC/ODBC API. +6. The application iterates over the result rows using the JDBC/ODBC API. + +When columnar data comes into play, however, problems arise. +JDBC is a row-oriented API, and while ODBC can support columnar data, the type system and data representation is not a perfect match with Arrow. +So generally, columnar data must be converted to rows between steps 5 and 6, spending resources without performing "useful" work. + +This mismatch is problematic for columnar database systems, such as ClickHouse, Dremio, DuckDB, and Google BigQuery. +On the client side, tools such as Apache Spark and pandas would be better off getting columnar data directly, skipping that conversion. +Otherwise, they're leaving performance on the table. +At the same time, that conversion isn't always avoidable. +Row-oriented database systems like PostgreSQL aren't going away, and these clients will still want to consume data from them. + +Developers have a few options: + +- *Just use JDBC/ODBC*. + These standards are here to stay, and it makes sense for databases to support them for applications that want them. + But when both the database and the application are columnar, that means converting data into rows for JDBC/ODBC, only for the client to convert them right back into columns! + Performance suffers, and developers have to spend time implementing the conversions. +- *Use JDBC/ODBC-to-Arrow conversion libraries*. + Libraries like [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc] handle row to columnar conversions for clients. Review Comment: ```suggestion Libraries like [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc] handle row-to-columnar conversions for clients. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] ianmcook commented on a diff in pull request #248: [Website] Add ADBC blog post
ianmcook commented on code in PR #248: URL: https://github.com/apache/arrow-site/pull/248#discussion_r1060100843 ## _posts/2022-12-31-arrow-adbc.md: ## @@ -0,0 +1,217 @@ +--- +layout: post +title: "Introducing ADBC: Database Access for Apache Arrow" +date: "2022-12-31 00:00:00" +author: pmc +categories: [application] +--- + + +The Arrow community would like to introduce version 1.0.0 of the [Arrow Database Connectivity (ADBC)][adbc] specification. +ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical applications. +Or in other words: **ADBC is a single API for getting Arrow data in and out of different databases**. + +## Motivation + +Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to work with databases. +That way, they can code to the same API regardless of the underlying database, saving on development time. +Roughly speaking, when an application executes a query with these APIs: + + + + The query execution flow. + + +1. The application submits a SQL query via the JDBC/ODBC API. +2. The query is passed on to the driver. +3. The driver translates the query to a database-specific protocol and sends it to the database. +4. The database executes the query and returns the result set in a database-specific format. +5. The driver translates the result format into the JDBC/ODBC API. +6. The application iterates over the result rows using the JDBC/ODBC API. + +When columnar data comes into play, however, problems arise. +JDBC is a row-oriented API, and while ODBC can support columnar data, the type system and data representation is not a perfect match with Arrow. +So generally, columnar data must be converted to rows between steps 5 and 6, spending resources without performing "useful" work. Review Comment: ```suggestion So generally, columnar data must be converted to rows in step 5, spending resources without performing "useful" work. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] ianmcook commented on a diff in pull request #248: [Website] Add ADBC blog post
ianmcook commented on code in PR #248: URL: https://github.com/apache/arrow-site/pull/248#discussion_r1060099812 ## _posts/2022-12-31-arrow-adbc.md: ## @@ -0,0 +1,217 @@ +--- +layout: post +title: "Introducing ADBC: Database Access for Apache Arrow" +date: "2022-12-31 00:00:00" +author: pmc +categories: [application] +--- + + +The Arrow community would like to introduce version 1.0.0 of the [Arrow Database Connectivity (ADBC)][adbc] specification. +ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical applications. +Or in other words: **ADBC is a single API for getting Arrow data in and out of different databases**. + +## Motivation + +Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to work with databases. +That way, they can code to the same API regardless of the underlying database, saving on development time. +Roughly speaking, when an application executes a query with these APIs: + + + + The query execution flow. + + +1. The application submits a SQL query via the JDBC/ODBC API. +2. The query is passed on to the driver. +3. The driver translates the query to a database-specific protocol and sends it to the database. +4. The database executes the query and returns the result set in a database-specific format. +5. The driver translates the result format into the JDBC/ODBC API. Review Comment: Is this what this means? ```suggestion 5. The driver translates the result into the format required by the JDBC/ODBC API. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] paleolimbot commented on pull request #288: [Website] WIP: Add nanoarrow blog post
paleolimbot commented on PR #288: URL: https://github.com/apache/arrow-site/pull/288#issuecomment-1369021346 Posting to the mailing list about this shortly...just adding that a rendered version of this that's easier to read can be found at https://github.com/paleolimbot/arrow-site/blob/nanoarrow-intro-post/_posts/2022-12-14-nanoarrow.md -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] lidavidm commented on pull request #248: [Website] Add ADBC blog post
lidavidm commented on PR #248: URL: https://github.com/apache/arrow-site/pull/248#issuecomment-1368096230 @ksuarez1423, @ianmcook any final comments? (Especially since I rewrote the post quite heavily.) Thanks for all your help & happy New Year's! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] eitsupi commented on issue #291: [R] version selector is broken
eitsupi commented on issue #291: URL: https://github.com/apache/arrow-site/issues/291#issuecomment-1367675878 Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] kou commented on issue #291: [R] version selector is broken
kou commented on issue #291: URL: https://github.com/apache/arrow-site/issues/291#issuecomment-1367571443 Deployed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] kou closed issue #291: [R] version selector is broken
kou closed issue #291: [R] version selector is broken URL: https://github.com/apache/arrow-site/issues/291 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] kou merged pull request #292: GH-291: [R] Update versions.json for 10.0.1
kou merged PR #292: URL: https://github.com/apache/arrow-site/pull/292 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] kou commented on issue #291: [R] version selector is broken
kou commented on issue #291: URL: https://github.com/apache/arrow-site/issues/291#issuecomment-1366966516 This was fixed by https://github.com/apache/arrow/pull/14887 but it's not deployed yet. #292 will fix this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] kou opened a new pull request, #292: GH-291: [R] Update versions.json for 10.0.1
kou opened a new pull request, #292: URL: https://github.com/apache/arrow-site/pull/292 Closes #291. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] eitsupi opened a new issue, #291: [R] version selector is broken
eitsupi opened a new issue, #291: URL: https://github.com/apache/arrow-site/issues/291 The development version is displayed on the release version site and I cannot go from the release version to the development version site. ( It is possible to go to the development version site from the past version). ![image](https://user-images.githubusercontent.com/50911393/209799756-bae159f2-194b-44d3-a39a-5c412be9f88c.png) This is due to the fact that the following lines were not updated by #273? https://github.com/apache/arrow-site/blob/6090a411a8ad51f7ec90d3b366cee19904bc03c1/docs/r/versions.json#L2-L9 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] alamb merged pull request #280: [WEBSITE]: Querying Parquet with Millisecond Latency
alamb merged PR #280: URL: https://github.com/apache/arrow-site/pull/280 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] alamb commented on pull request #280: [WEBSITE]: Querying Parquet with Millisecond Latency
alamb commented on PR #280: URL: https://github.com/apache/arrow-site/pull/280#issuecomment-1365316876 Per the mailing list discussion https://lists.apache.org/thread/l377q5f20kyltb37m345p287kpo22qb6 I plan to publish this later today or tomorrow -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] lidavidm commented on a diff in pull request #248: [Website] Add ADBC blog post
lidavidm commented on code in PR #248: URL: https://github.com/apache/arrow-site/pull/248#discussion_r1055873826 ## _posts/2022-12-31-arrow-adbc.md: ## @@ -0,0 +1,217 @@ +--- +layout: post +title: "Introducing ADBC: Database Access for Apache Arrow" +date: "2022-12-31 00:00:00" +author: pmc +categories: [application] +--- + + +The Arrow community would like to introduce version 1.0.0 of the [Arrow Database Connectivity (ADBC)][adbc] specification. +ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical applications. +Or in other words: **ADBC is a single API for getting Arrow data in and out of different databases**. + +## Motivation + +Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to work with databases. +That way, they can code to the same API regardless of the underlying database, saving on development time. +Roughly speaking, when an application executes a query with these APIs: + + + + The query execution flow. + + +1. The application submits a SQL query via the JDBC/ODBC API. +2. The query is passed on to the driver. +3. The driver translates the query to a database-specific protocol and sends it to the database. +4. The database executes the query and returns the result set in a database-specific format. +5. The driver translates the result format into the JDBC/ODBC API. +6. The application iterates over the result rows using the JDBC/ODBC API. + +When columnar data comes into play, however, problems arise. +JDBC is a row-oriented API, and while ODBC can support columnar data, the type system and data representation is not a perfect match with Arrow. +So generally, columnar data must be converted to rows between steps 5 and 6, spending resources without performing "useful" work. + +This mismatch is problematic for columnar database systems, such as ClickHouse, Dremio, DuckDB, and Google BigQuery. +On the client side, tools such as Apache Spark and pandas would be better off getting columnar data directly, skipping that conversion. +Otherwise, they're leaving performance on the table. +At the same time, that conversion isn't always avoidable. +Row-oriented database systems like PostgreSQL aren't going away, and these clients will still want to consume data from them. + +Developers have a few options: + +- *Just use JDBC/ODBC*. + These standards are here to stay, and it makes sense for databases to support them for applications that want them. + But when both the database and the application are columnar, that means converting data into rows for JDBC/ODBC, only for the client to convert them right back into columns! + Performance suffers, and developers have to spend time implementing the conversions. +- *Use JDBC/ODBC-to-Arrow conversion libraries*. + Libraries like [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc] handle row to columnar conversions for clients. + But this doesn't fundamentally solve the problem. + Unnecessary data conversions are still required. +- *Use vendor-specific protocols*. + For some databases, applications can use a database-specific protocol or SDK to directly get Arrow data. + For example, applications could use Dremio via [Arrow Flight SQL][flight-sql]. + But client applications that want to use multiple database vendors would need to integrate with each of them. + (Look at all the [connectors](https://trino.io/docs/current/connector.html) that Trino implements.) + And databases like PostgreSQL don't offer an option supporting Arrow in the first place. + +So in the status quo, clients must choose between either tedious integration work or leaving performance on the table. Review Comment: Thanks! Updated the preview. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] ksuarez1423 commented on a diff in pull request #248: [Website] Add ADBC blog post
ksuarez1423 commented on code in PR #248: URL: https://github.com/apache/arrow-site/pull/248#discussion_r1055763555 ## _posts/2022-12-31-arrow-adbc.md: ## @@ -0,0 +1,217 @@ +--- +layout: post +title: "Introducing ADBC: Database Access for Apache Arrow" +date: "2022-12-31 00:00:00" +author: pmc +categories: [application] +--- + + +The Arrow community would like to introduce version 1.0.0 of the [Arrow Database Connectivity (ADBC)][adbc] specification. +ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical applications. +Or in other words: **ADBC is a single API for getting Arrow data in and out of different databases**. + +## Motivation + +Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to work with databases. +That way, they can code to the same API regardless of the underlying database, saving on development time. +Roughly speaking, when an application executes a query with these APIs: + + + + The query execution flow. + + +1. The application submits a SQL query via the JDBC/ODBC API. +2. The query is passed on to the driver. +3. The driver translates the query to a database-specific protocol and sends it to the database. +4. The database executes the query and returns the result set in a database-specific format. +5. The driver translates the result format into the JDBC/ODBC API. +6. The application iterates over the result rows using the JDBC/ODBC API. + +When columnar data comes into play, however, problems arise. +JDBC is a row-oriented API, and while ODBC can support columnar data, the type system and data representation is not a perfect match with Arrow. +So generally, columnar data must be converted to rows between steps 5 and 6, spending resources without performing "useful" work. + +This mismatch is problematic for columnar database systems, such as ClickHouse, Dremio, DuckDB, and Google BigQuery. +On the client side, tools such as Apache Spark and pandas would be better off getting columnar data directly, skipping that conversion. +Otherwise, they're leaving performance on the table. +At the same time, that conversion isn't always avoidable. +Row-oriented database systems like PostgreSQL aren't going away, and these clients will still want to consume data from them. + +Developers have a few options: + +- *Just use JDBC/ODBC*. + These standards are here to stay, and it makes sense for databases to support them for applications that want them. + But when both the database and the application are columnar, that means converting data into rows for JDBC/ODBC, only for the client to convert them right back into columns! + Performance suffers, and developers have to spend time implementing the conversions. +- *Use JDBC/ODBC-to-Arrow conversion libraries*. + Libraries like [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc] handle row to columnar conversions for clients. + But this doesn't fundamentally solve the problem. + Unnecessary data conversions are still required. +- *Use vendor-specific protocols*. + For some databases, applications can use a database-specific protocol or SDK to directly get Arrow data. + For example, applications could use Dremio via [Arrow Flight SQL][flight-sql]. + But client applications that want to use multiple database vendors would need to integrate with each of them. + (Look at all the [connectors](https://trino.io/docs/current/connector.html) that Trino implements.) + And databases like PostgreSQL don't offer an option supporting Arrow in the first place. + +So in the status quo, clients must choose between either tedious integration work or leaving performance on the table. Review Comment: ```suggestion As is, clients must choose between either tedious integration work or leaving performance on the table. We can make this better. ``` I think this could be punchier as a section ender. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] lidavidm commented on a diff in pull request #248: [Website] Add ADBC blog post
lidavidm commented on code in PR #248: URL: https://github.com/apache/arrow-site/pull/248#discussion_r1055655147 ## _posts/2022-12-31-arrow-adbc.md: ## @@ -0,0 +1,217 @@ +--- +layout: post +title: "Introducing ADBC: Database Access for Apache Arrow" +date: "2022-12-31 00:00:00" +author: pmc +categories: [application] +--- + + +The Arrow community would like to introduce version 1.0.0 of the [Arrow Database Connectivity (ADBC)][adbc] specification. +ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical applications. +Or in other words: **ADBC is a single API for getting Arrow data in and out of different databases**. + +## Motivation + +Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to work with databases. +That way, they can code to the same API regardless of the underlying database, saving on development time. +Roughly speaking, when an application executes a query with these APIs: + +1. The application submits a SQL query via the JDBC/ODBC API. +2. The query is passed on to the driver. +3. The driver translates the query to a database-specific protocol and sends it to the database. +4. The database executes the query and returns the result set in a database-specific format. +5. The driver translates the result format into the JDBC/ODBC API. +6. The application iterates over the result rows using the JDBC/ODBC API. + + + + The query execution flow. + + +When columnar data comes into play, however, problems arise. +JDBC is a row-oriented API, and while ODBC can support columnar data, the type system and data representation is not a perfect match with Arrow. +So generally, columnar data must be converted to rows between steps 5 and 6, spending resources without performing "useful" work. + +This mismatch is problematic for columnar database systems, such as ClickHouse, Dremio, DuckDB, and Google BigQuery. +On the client side, tools such as Apache Spark and pandas would be better off getting columnar data directly, skipping that conversion. +Otherwise, they're leaving performance on the table. +At the same time, that conversion isn't always avoidable. +Row-oriented database systems like PostgreSQL aren't going away, and these clients will still want to consume data from them. + +Developers have a few options: + +- *Just use JDBC/ODBC*. + These standards are here to stay, and it makes sense for databases to support them for applications that want them. + But when both the database and the application are columnar, that means converting data into rows for JDBC/ODBC, only for the client to convert them right back into columns! + Performance suffers, and developers have to spend time implementing the conversions. +- *Use JDBC/ODBC to Arrow conversion libraries*. + Libraries like [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc] handle row to columnar conversions for clients. + But this doesn't fundamentally solve the problem. + Unnecessary data conversions are still required. +- *Use vendor-specific protocols*. + For some databases, applications can use a database-specific protocol or SDK to directly get Arrow data. + For example, applications could use Dremio via [Arrow Flight SQL][flight-sql]. + But client applications that want to use multiple database vendors would need to integrate with each of them. + (Look at all the [connectors](https://trino.io/docs/current/connector.html) that Trino implements.) + And databases like PostgreSQL don't offer an option supporting Arrow in the first place. + +So in the status quo, clients must choose between either tedious integration work or leaving performance on the table. + +## Introducing ADBC + +ADBC is an Arrow-based, vendor-netural API for interacting with databases. +Applications that use ADBC just get Arrow data. +They don't have to do any conversions themselves, and they don't have to integrate each database's specific SDK. + +Just like JDBC/ODBC, underneath the ADBC API are drivers that translate the API for specific databases. + +* A driver for an Arrow-native database just passes Arrow data through without conversion. +* A driver for a non-Arrow-native database must convert the data to Arrow. + This saves the application from doing that, and the driver can optimize the conversion for its database. + + + + The query execution flow with two different ADBC drivers. + + Review Comment: Updated (diagram precedes list in both places) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] ianmcook commented on a diff in pull request #248: [Website] Add ADBC blog post
ianmcook commented on code in PR #248: URL: https://github.com/apache/arrow-site/pull/248#discussion_r1055650626 ## _posts/2022-12-31-arrow-adbc.md: ## @@ -0,0 +1,217 @@ +--- +layout: post +title: "Introducing ADBC: Database Access for Apache Arrow" +date: "2022-12-31 00:00:00" +author: pmc +categories: [application] +--- + + +The Arrow community would like to introduce version 1.0.0 of the [Arrow Database Connectivity (ADBC)][adbc] specification. +ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical applications. +Or in other words: **ADBC is a single API for getting Arrow data in and out of different databases**. + +## Motivation + +Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to work with databases. +That way, they can code to the same API regardless of the underlying database, saving on development time. +Roughly speaking, when an application executes a query with these APIs: + +1. The application submits a SQL query via the JDBC/ODBC API. +2. The query is passed on to the driver. +3. The driver translates the query to a database-specific protocol and sends it to the database. +4. The database executes the query and returns the result set in a database-specific format. +5. The driver translates the result format into the JDBC/ODBC API. +6. The application iterates over the result rows using the JDBC/ODBC API. + + + + The query execution flow. + + +When columnar data comes into play, however, problems arise. +JDBC is a row-oriented API, and while ODBC can support columnar data, the type system and data representation is not a perfect match with Arrow. +So generally, columnar data must be converted to rows between steps 5 and 6, spending resources without performing "useful" work. + +This mismatch is problematic for columnar database systems, such as ClickHouse, Dremio, DuckDB, and Google BigQuery. +On the client side, tools such as Apache Spark and pandas would be better off getting columnar data directly, skipping that conversion. +Otherwise, they're leaving performance on the table. +At the same time, that conversion isn't always avoidable. +Row-oriented database systems like PostgreSQL aren't going away, and these clients will still want to consume data from them. + +Developers have a few options: + +- *Just use JDBC/ODBC*. + These standards are here to stay, and it makes sense for databases to support them for applications that want them. + But when both the database and the application are columnar, that means converting data into rows for JDBC/ODBC, only for the client to convert them right back into columns! + Performance suffers, and developers have to spend time implementing the conversions. +- *Use JDBC/ODBC to Arrow conversion libraries*. + Libraries like [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc] handle row to columnar conversions for clients. + But this doesn't fundamentally solve the problem. + Unnecessary data conversions are still required. +- *Use vendor-specific protocols*. + For some databases, applications can use a database-specific protocol or SDK to directly get Arrow data. + For example, applications could use Dremio via [Arrow Flight SQL][flight-sql]. + But client applications that want to use multiple database vendors would need to integrate with each of them. + (Look at all the [connectors](https://trino.io/docs/current/connector.html) that Trino implements.) + And databases like PostgreSQL don't offer an option supporting Arrow in the first place. + +So in the status quo, clients must choose between either tedious integration work or leaving performance on the table. + +## Introducing ADBC + +ADBC is an Arrow-based, vendor-netural API for interacting with databases. +Applications that use ADBC just get Arrow data. +They don't have to do any conversions themselves, and they don't have to integrate each database's specific SDK. + +Just like JDBC/ODBC, underneath the ADBC API are drivers that translate the API for specific databases. + +* A driver for an Arrow-native database just passes Arrow data through without conversion. +* A driver for a non-Arrow-native database must convert the data to Arrow. + This saves the application from doing that, and the driver can optimize the conversion for its database. + + + + The query execution flow with two different ADBC drivers. + + Review Comment: Here the numbered list follows the diagram, whereas above, the numbered list precedes the diagram. It'd probably be best to have the order the same in both places. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] lidavidm commented on a diff in pull request #248: [Website] Add ADBC blog post
lidavidm commented on code in PR #248: URL: https://github.com/apache/arrow-site/pull/248#discussion_r1055647370 ## _posts/2022-12-31-arrow-adbc.md: ## @@ -0,0 +1,217 @@ +--- +layout: post +title: "Introducing ADBC: Database Access for Apache Arrow" +date: "2022-12-31 00:00:00" +author: pmc +categories: [application] +--- + + +The Arrow community would like to introduce version 1.0.0 of the [Arrow Database Connectivity (ADBC)][adbc] specification. +ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical applications. +Or in other words: **ADBC is a single API for getting Arrow data in and out of different databases**. + +## Motivation + +Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to work with databases. +That way, they can code to the same API regardless of the underlying database, saving on development time. +Roughly speaking, when an application executes a query with these APIs: + +1. The application submits a SQL query via the JDBC/ODBC API. +2. The query is passed on to the driver. +3. The driver translates the query to a database-specific protocol and sends it to the database. +4. The database executes the query and returns the result set in a database-specific format. +5. The driver translates the result format into the JDBC/ODBC API. +6. The application iterates over the result rows using the JDBC/ODBC API. + + + + The query execution flow. + + +When columnar data comes into play, however, problems arise. +JDBC is a row-oriented API, and while ODBC can support columnar data, the type system and data representation is not a perfect match with Arrow. +So generally, columnar data must be converted to rows between steps 5 and 6, spending resources without performing "useful" work. + +This mismatch is problematic for columnar database systems, such as ClickHouse, Dremio, DuckDB, and Google BigQuery. +On the client side, tools such as Apache Spark and pandas would be better off getting columnar data directly, skipping that conversion. +Otherwise, they're leaving performance on the table. +At the same time, that conversion isn't always avoidable. +Row-oriented database systems like PostgreSQL aren't going away, and these clients will still want to consume data from them. + +Developers have a few options: + +- *Just use JDBC/ODBC*. + These standards are here to stay, and it makes sense for databases to support them for applications that want them. + But when both the database and the application are columnar, that means converting data into rows for JDBC/ODBC, only for the client to convert them right back into columns! + Performance suffers, and developers have to spend time implementing the conversions. +- *Use JDBC/ODBC to Arrow conversion libraries*. Review Comment: thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] ianmcook commented on a diff in pull request #248: [Website] Add ADBC blog post
ianmcook commented on code in PR #248: URL: https://github.com/apache/arrow-site/pull/248#discussion_r1055645668 ## _posts/2022-12-31-arrow-adbc.md: ## @@ -0,0 +1,217 @@ +--- +layout: post +title: "Introducing ADBC: Database Access for Apache Arrow" +date: "2022-12-31 00:00:00" +author: pmc +categories: [application] +--- + + +The Arrow community would like to introduce version 1.0.0 of the [Arrow Database Connectivity (ADBC)][adbc] specification. +ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical applications. +Or in other words: **ADBC is a single API for getting Arrow data in and out of different databases**. + +## Motivation + +Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to work with databases. +That way, they can code to the same API regardless of the underlying database, saving on development time. +Roughly speaking, when an application executes a query with these APIs: + +1. The application submits a SQL query via the JDBC/ODBC API. +2. The query is passed on to the driver. +3. The driver translates the query to a database-specific protocol and sends it to the database. +4. The database executes the query and returns the result set in a database-specific format. +5. The driver translates the result format into the JDBC/ODBC API. +6. The application iterates over the result rows using the JDBC/ODBC API. + + + + The query execution flow. + + +When columnar data comes into play, however, problems arise. +JDBC is a row-oriented API, and while ODBC can support columnar data, the type system and data representation is not a perfect match with Arrow. +So generally, columnar data must be converted to rows between steps 5 and 6, spending resources without performing "useful" work. + +This mismatch is problematic for columnar database systems, such as ClickHouse, Dremio, DuckDB, and Google BigQuery. +On the client side, tools such as Apache Spark and pandas would be better off getting columnar data directly, skipping that conversion. +Otherwise, they're leaving performance on the table. +At the same time, that conversion isn't always avoidable. +Row-oriented database systems like PostgreSQL aren't going away, and these clients will still want to consume data from them. + +Developers have a few options: + +- *Just use JDBC/ODBC*. + These standards are here to stay, and it makes sense for databases to support them for applications that want them. + But when both the database and the application are columnar, that means converting data into rows for JDBC/ODBC, only for the client to convert them right back into columns! + Performance suffers, and developers have to spend time implementing the conversions. +- *Use JDBC/ODBC to Arrow conversion libraries*. Review Comment: reads more clearly with hyphens ```suggestion - *Use JDBC/ODBC-to-Arrow conversion libraries*. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] lidavidm commented on a diff in pull request #248: [Website] Add ADBC blog post
lidavidm commented on code in PR #248: URL: https://github.com/apache/arrow-site/pull/248#discussion_r1053762319 ## _posts/2022-12-31-arrow-adbc.md: ## @@ -0,0 +1,252 @@ +--- +layout: post +title: "Introducing ADBC: Database Access for Apache Arrow" +date: "2022-12-31 00:00:00" +author: pmc +categories: [application] +--- + + +The Arrow community would like to introduce version 1.0.0 of the [Arrow Database Connectivity (ADBC)][adbc] specification. +**ADBC aims to be an columnar, minimal-overhead alternative JDBC/ODBC for analytical applications**. +It defines vendor-agnostic and Arrow-based APIs for common database tasks, like executing queries and getting basic metadata. +These APIs are available, either directly or via bindings, in C/C++, Go, Java, Python, Ruby, and soon R. + +With ADBC, developers get both the benefits of using columnar Arrow data and having generic API abstractions. +Like [JDBC][jdbc]/[ODBC][odbc], ADBC defines database-independent interaction APIs, and relies on drivers to implement those APIs for particular databases. +ADBC aims to bring all of these together under a single API: + +- Vendor-specific Arrow-native protocols, like [Arrow Flight SQL][flight-sql] or those offered by ClickHouse or Google BigQuery; +- Non-columnar protocols, like the PostgreSQL wire format; +- Non-columnar API abstractions, like JDBC/ODBC. + +In other words: **ADBC is a single API for getting Arrow data in and out of databases**. +Underneath, ADBC driver implementations take care of bridging the actual system: + +- Databases with Arrow-native protocols can directly pass data through without conversion. +- Otherwise, drivers can be built for specific row-based protocols, optimizing conversions to and from Arrow data as best as possible for particular databases. +- As a fallback, drivers can be built that convert data from JDBC/ODBC, bridging existing databases into an Arrow-native API. + +In all cases, the application is saved the trouble of wrapping APIs and doing data conversions. + +## Motivation + +Applications often use API standards like JDBC and ODBC to work with databases. +This lets them use the same API regardless of the underlying database, saving on development time. +Roughly speaking, when an application executes a query with these APIs: + +1. The application submits a SQL query via the JDBC/ODBC APIs. +2. The query is passed on to the driver. +3. The driver translates the query to a database-specific protocol and sends it to the database. +4. The database executes the query and returns the result set in a database-specific format. +5. The driver translates the result format into the JDBC/ODBC API. + + + + The query execution flow. + + +When columnar data comes into play, however, problems arise. +JDBC is a row-oriented API, and while ODBC can support columnar data, the type system and data representation is not a perfect match with Arrow. +In both cases, this leads to data conversions around steps 4–5, spending resources without performing "useful" work. + +This mismatch is important for columnar database systems, such as ClickHouse, Dremio, DuckDB, Google BigQuery, and others. +Clients, such as Apache Spark and pandas, would like to get columnar data directly from these systems. +Meanwhile, traditional database systems aren't going away, and these clients still want to consume data from them. + +In response, we've seen a few solutions: + +- *Just provide JDBC/ODBC drivers*. + These standards are here to stay, and it makes sense to provide these interfaces for applications that want them. + But if both sides are columnar, that means converting data into rows for JDBC/ODBC, only for the client to convert them back into columns! +- *Provide converters from JDBC/ODBC to Arrow*. + Some examples include [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc]. + This approach reduces the burden on client applications, but doesn't fundamentally solve the problem. + Unnecessary data conversions are still required. +- *Provide special SDKs*. + All of the columnar systems listed above do offer ways to get Arrow data, such as via [Arrow Flight SQL][flight-sql]. + But client applications need to spend time to integrate with each of them. + (Just look at all the [connectors](https://trino.io/docs/current/connector.html) that Trino implements.) + And not every system offers this option. + +ADBC combines the advantages of the latter two solutions under one API. +In other words, ADBC provides a set of API definitions that client applications code to. +These API definitions are Arrow-based. +The application then links to or loads drivers for the actual database, which implement the API definitions. +If the database is Arrow-native, the driver can just pass the data through without conversion. +Otherwise, the driver converts the data to Arrow format first. + + + + The query execution flow with two different ADBC drivers. + + +1. The application submits
[GitHub] [arrow-site] lidavidm commented on a diff in pull request #248: [Website] Add ADBC blog post
lidavidm commented on code in PR #248: URL: https://github.com/apache/arrow-site/pull/248#discussion_r1053730160 ## _posts/2022-12-31-arrow-adbc.md: ## @@ -0,0 +1,252 @@ +--- +layout: post +title: "Introducing ADBC: Database Access for Apache Arrow" +date: "2022-12-31 00:00:00" +author: pmc +categories: [application] +--- + + +The Arrow community would like to introduce version 1.0.0 of the [Arrow Database Connectivity (ADBC)][adbc] specification. +**ADBC aims to be an columnar, minimal-overhead alternative JDBC/ODBC for analytical applications**. +It defines vendor-agnostic and Arrow-based APIs for common database tasks, like executing queries and getting basic metadata. +These APIs are available, either directly or via bindings, in C/C++, Go, Java, Python, Ruby, and soon R. + +With ADBC, developers get both the benefits of using columnar Arrow data and having generic API abstractions. +Like [JDBC][jdbc]/[ODBC][odbc], ADBC defines database-independent interaction APIs, and relies on drivers to implement those APIs for particular databases. +ADBC aims to bring all of these together under a single API: + +- Vendor-specific Arrow-native protocols, like [Arrow Flight SQL][flight-sql] or those offered by ClickHouse or Google BigQuery; +- Non-columnar protocols, like the PostgreSQL wire format; +- Non-columnar API abstractions, like JDBC/ODBC. + +In other words: **ADBC is a single API for getting Arrow data in and out of databases**. +Underneath, ADBC driver implementations take care of bridging the actual system: + +- Databases with Arrow-native protocols can directly pass data through without conversion. +- Otherwise, drivers can be built for specific row-based protocols, optimizing conversions to and from Arrow data as best as possible for particular databases. +- As a fallback, drivers can be built that convert data from JDBC/ODBC, bridging existing databases into an Arrow-native API. + +In all cases, the application is saved the trouble of wrapping APIs and doing data conversions. + +## Motivation + +Applications often use API standards like JDBC and ODBC to work with databases. +This lets them use the same API regardless of the underlying database, saving on development time. +Roughly speaking, when an application executes a query with these APIs: + +1. The application submits a SQL query via the JDBC/ODBC APIs. +2. The query is passed on to the driver. +3. The driver translates the query to a database-specific protocol and sends it to the database. +4. The database executes the query and returns the result set in a database-specific format. +5. The driver translates the result format into the JDBC/ODBC API. + + + + The query execution flow. + + +When columnar data comes into play, however, problems arise. +JDBC is a row-oriented API, and while ODBC can support columnar data, the type system and data representation is not a perfect match with Arrow. +In both cases, this leads to data conversions around steps 4–5, spending resources without performing "useful" work. + +This mismatch is important for columnar database systems, such as ClickHouse, Dremio, DuckDB, Google BigQuery, and others. +Clients, such as Apache Spark and pandas, would like to get columnar data directly from these systems. +Meanwhile, traditional database systems aren't going away, and these clients still want to consume data from them. + +In response, we've seen a few solutions: + +- *Just provide JDBC/ODBC drivers*. + These standards are here to stay, and it makes sense to provide these interfaces for applications that want them. + But if both sides are columnar, that means converting data into rows for JDBC/ODBC, only for the client to convert them back into columns! +- *Provide converters from JDBC/ODBC to Arrow*. + Some examples include [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc]. + This approach reduces the burden on client applications, but doesn't fundamentally solve the problem. + Unnecessary data conversions are still required. +- *Provide special SDKs*. + All of the columnar systems listed above do offer ways to get Arrow data, such as via [Arrow Flight SQL][flight-sql]. + But client applications need to spend time to integrate with each of them. + (Just look at all the [connectors](https://trino.io/docs/current/connector.html) that Trino implements.) + And not every system offers this option. + +ADBC combines the advantages of the latter two solutions under one API. +In other words, ADBC provides a set of API definitions that client applications code to. +These API definitions are Arrow-based. +The application then links to or loads drivers for the actual database, which implement the API definitions. +If the database is Arrow-native, the driver can just pass the data through without conversion. +Otherwise, the driver converts the data to Arrow format first. + + + + The query execution flow with two different ADBC drivers. + + +1. The application submits
[GitHub] [arrow-site] ksuarez1423 commented on a diff in pull request #248: [Website] Add ADBC blog post
ksuarez1423 commented on code in PR #248: URL: https://github.com/apache/arrow-site/pull/248#discussion_r1053672404 ## _posts/2022-12-31-arrow-adbc.md: ## @@ -0,0 +1,252 @@ +--- +layout: post +title: "Introducing ADBC: Database Access for Apache Arrow" +date: "2022-12-31 00:00:00" +author: pmc +categories: [application] +--- + + +The Arrow community would like to introduce version 1.0.0 of the [Arrow Database Connectivity (ADBC)][adbc] specification. +**ADBC aims to be an columnar, minimal-overhead alternative JDBC/ODBC for analytical applications**. +It defines vendor-agnostic and Arrow-based APIs for common database tasks, like executing queries and getting basic metadata. +These APIs are available, either directly or via bindings, in C/C++, Go, Java, Python, Ruby, and soon R. + +With ADBC, developers get both the benefits of using columnar Arrow data and having generic API abstractions. +Like [JDBC][jdbc]/[ODBC][odbc], ADBC defines database-independent interaction APIs, and relies on drivers to implement those APIs for particular databases. +ADBC aims to bring all of these together under a single API: + +- Vendor-specific Arrow-native protocols, like [Arrow Flight SQL][flight-sql] or those offered by ClickHouse or Google BigQuery; +- Non-columnar protocols, like the PostgreSQL wire format; +- Non-columnar API abstractions, like JDBC/ODBC. + +In other words: **ADBC is a single API for getting Arrow data in and out of databases**. +Underneath, ADBC driver implementations take care of bridging the actual system: + +- Databases with Arrow-native protocols can directly pass data through without conversion. +- Otherwise, drivers can be built for specific row-based protocols, optimizing conversions to and from Arrow data as best as possible for particular databases. +- As a fallback, drivers can be built that convert data from JDBC/ODBC, bridging existing databases into an Arrow-native API. + +In all cases, the application is saved the trouble of wrapping APIs and doing data conversions. + +## Motivation + +Applications often use API standards like JDBC and ODBC to work with databases. +This lets them use the same API regardless of the underlying database, saving on development time. +Roughly speaking, when an application executes a query with these APIs: + +1. The application submits a SQL query via the JDBC/ODBC APIs. +2. The query is passed on to the driver. +3. The driver translates the query to a database-specific protocol and sends it to the database. +4. The database executes the query and returns the result set in a database-specific format. +5. The driver translates the result format into the JDBC/ODBC API. + + + + The query execution flow. + + +When columnar data comes into play, however, problems arise. +JDBC is a row-oriented API, and while ODBC can support columnar data, the type system and data representation is not a perfect match with Arrow. +In both cases, this leads to data conversions around steps 4–5, spending resources without performing "useful" work. + +This mismatch is important for columnar database systems, such as ClickHouse, Dremio, DuckDB, Google BigQuery, and others. +Clients, such as Apache Spark and pandas, would like to get columnar data directly from these systems. +Meanwhile, traditional database systems aren't going away, and these clients still want to consume data from them. + +In response, we've seen a few solutions: + +- *Just provide JDBC/ODBC drivers*. + These standards are here to stay, and it makes sense to provide these interfaces for applications that want them. + But if both sides are columnar, that means converting data into rows for JDBC/ODBC, only for the client to convert them back into columns! +- *Provide converters from JDBC/ODBC to Arrow*. + Some examples include [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc]. + This approach reduces the burden on client applications, but doesn't fundamentally solve the problem. + Unnecessary data conversions are still required. +- *Provide special SDKs*. + All of the columnar systems listed above do offer ways to get Arrow data, such as via [Arrow Flight SQL][flight-sql]. + But client applications need to spend time to integrate with each of them. + (Just look at all the [connectors](https://trino.io/docs/current/connector.html) that Trino implements.) + And not every system offers this option. + +ADBC combines the advantages of the latter two solutions under one API. +In other words, ADBC provides a set of API definitions that client applications code to. +These API definitions are Arrow-based. +The application then links to or loads drivers for the actual database, which implement the API definitions. +If the database is Arrow-native, the driver can just pass the data through without conversion. +Otherwise, the driver converts the data to Arrow format first. + + + + The query execution flow with two different ADBC drivers. + + +1. The application
[GitHub] [arrow-site] ksuarez1423 commented on a diff in pull request #248: [Website] Add ADBC blog post
ksuarez1423 commented on code in PR #248: URL: https://github.com/apache/arrow-site/pull/248#discussion_r1053672404 ## _posts/2022-12-31-arrow-adbc.md: ## @@ -0,0 +1,252 @@ +--- +layout: post +title: "Introducing ADBC: Database Access for Apache Arrow" +date: "2022-12-31 00:00:00" +author: pmc +categories: [application] +--- + + +The Arrow community would like to introduce version 1.0.0 of the [Arrow Database Connectivity (ADBC)][adbc] specification. +**ADBC aims to be an columnar, minimal-overhead alternative JDBC/ODBC for analytical applications**. +It defines vendor-agnostic and Arrow-based APIs for common database tasks, like executing queries and getting basic metadata. +These APIs are available, either directly or via bindings, in C/C++, Go, Java, Python, Ruby, and soon R. + +With ADBC, developers get both the benefits of using columnar Arrow data and having generic API abstractions. +Like [JDBC][jdbc]/[ODBC][odbc], ADBC defines database-independent interaction APIs, and relies on drivers to implement those APIs for particular databases. +ADBC aims to bring all of these together under a single API: + +- Vendor-specific Arrow-native protocols, like [Arrow Flight SQL][flight-sql] or those offered by ClickHouse or Google BigQuery; +- Non-columnar protocols, like the PostgreSQL wire format; +- Non-columnar API abstractions, like JDBC/ODBC. + +In other words: **ADBC is a single API for getting Arrow data in and out of databases**. +Underneath, ADBC driver implementations take care of bridging the actual system: + +- Databases with Arrow-native protocols can directly pass data through without conversion. +- Otherwise, drivers can be built for specific row-based protocols, optimizing conversions to and from Arrow data as best as possible for particular databases. +- As a fallback, drivers can be built that convert data from JDBC/ODBC, bridging existing databases into an Arrow-native API. + +In all cases, the application is saved the trouble of wrapping APIs and doing data conversions. + +## Motivation + +Applications often use API standards like JDBC and ODBC to work with databases. +This lets them use the same API regardless of the underlying database, saving on development time. +Roughly speaking, when an application executes a query with these APIs: + +1. The application submits a SQL query via the JDBC/ODBC APIs. +2. The query is passed on to the driver. +3. The driver translates the query to a database-specific protocol and sends it to the database. +4. The database executes the query and returns the result set in a database-specific format. +5. The driver translates the result format into the JDBC/ODBC API. + + + + The query execution flow. + + +When columnar data comes into play, however, problems arise. +JDBC is a row-oriented API, and while ODBC can support columnar data, the type system and data representation is not a perfect match with Arrow. +In both cases, this leads to data conversions around steps 4–5, spending resources without performing "useful" work. + +This mismatch is important for columnar database systems, such as ClickHouse, Dremio, DuckDB, Google BigQuery, and others. +Clients, such as Apache Spark and pandas, would like to get columnar data directly from these systems. +Meanwhile, traditional database systems aren't going away, and these clients still want to consume data from them. + +In response, we've seen a few solutions: + +- *Just provide JDBC/ODBC drivers*. + These standards are here to stay, and it makes sense to provide these interfaces for applications that want them. + But if both sides are columnar, that means converting data into rows for JDBC/ODBC, only for the client to convert them back into columns! +- *Provide converters from JDBC/ODBC to Arrow*. + Some examples include [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc]. + This approach reduces the burden on client applications, but doesn't fundamentally solve the problem. + Unnecessary data conversions are still required. +- *Provide special SDKs*. + All of the columnar systems listed above do offer ways to get Arrow data, such as via [Arrow Flight SQL][flight-sql]. + But client applications need to spend time to integrate with each of them. + (Just look at all the [connectors](https://trino.io/docs/current/connector.html) that Trino implements.) + And not every system offers this option. + +ADBC combines the advantages of the latter two solutions under one API. +In other words, ADBC provides a set of API definitions that client applications code to. +These API definitions are Arrow-based. +The application then links to or loads drivers for the actual database, which implement the API definitions. +If the database is Arrow-native, the driver can just pass the data through without conversion. +Otherwise, the driver converts the data to Arrow format first. + + + + The query execution flow with two different ADBC drivers. + + +1. The application
[GitHub] [arrow-site] lidavidm commented on a diff in pull request #248: [Website] Add ADBC blog post
lidavidm commented on code in PR #248: URL: https://github.com/apache/arrow-site/pull/248#discussion_r1053653049 ## _posts/2022-12-31-arrow-adbc.md: ## @@ -0,0 +1,252 @@ +--- +layout: post +title: "Introducing ADBC: Database Access for Apache Arrow" +date: "2022-12-31 00:00:00" +author: pmc +categories: [application] +--- + + +The Arrow community would like to introduce version 1.0.0 of the [Arrow Database Connectivity (ADBC)][adbc] specification. +**ADBC aims to be an columnar, minimal-overhead alternative JDBC/ODBC for analytical applications**. +It defines vendor-agnostic and Arrow-based APIs for common database tasks, like executing queries and getting basic metadata. +These APIs are available, either directly or via bindings, in C/C++, Go, Java, Python, Ruby, and soon R. + +With ADBC, developers get both the benefits of using columnar Arrow data and having generic API abstractions. +Like [JDBC][jdbc]/[ODBC][odbc], ADBC defines database-independent interaction APIs, and relies on drivers to implement those APIs for particular databases. +ADBC aims to bring all of these together under a single API: + +- Vendor-specific Arrow-native protocols, like [Arrow Flight SQL][flight-sql] or those offered by ClickHouse or Google BigQuery; +- Non-columnar protocols, like the PostgreSQL wire format; +- Non-columnar API abstractions, like JDBC/ODBC. + +In other words: **ADBC is a single API for getting Arrow data in and out of databases**. +Underneath, ADBC driver implementations take care of bridging the actual system: + +- Databases with Arrow-native protocols can directly pass data through without conversion. +- Otherwise, drivers can be built for specific row-based protocols, optimizing conversions to and from Arrow data as best as possible for particular databases. +- As a fallback, drivers can be built that convert data from JDBC/ODBC, bridging existing databases into an Arrow-native API. + +In all cases, the application is saved the trouble of wrapping APIs and doing data conversions. + +## Motivation + +Applications often use API standards like JDBC and ODBC to work with databases. +This lets them use the same API regardless of the underlying database, saving on development time. +Roughly speaking, when an application executes a query with these APIs: + +1. The application submits a SQL query via the JDBC/ODBC APIs. +2. The query is passed on to the driver. +3. The driver translates the query to a database-specific protocol and sends it to the database. +4. The database executes the query and returns the result set in a database-specific format. +5. The driver translates the result format into the JDBC/ODBC API. + + + + The query execution flow. + + +When columnar data comes into play, however, problems arise. +JDBC is a row-oriented API, and while ODBC can support columnar data, the type system and data representation is not a perfect match with Arrow. +In both cases, this leads to data conversions around steps 4–5, spending resources without performing "useful" work. + +This mismatch is important for columnar database systems, such as ClickHouse, Dremio, DuckDB, Google BigQuery, and others. +Clients, such as Apache Spark and pandas, would like to get columnar data directly from these systems. +Meanwhile, traditional database systems aren't going away, and these clients still want to consume data from them. + +In response, we've seen a few solutions: + +- *Just provide JDBC/ODBC drivers*. + These standards are here to stay, and it makes sense to provide these interfaces for applications that want them. + But if both sides are columnar, that means converting data into rows for JDBC/ODBC, only for the client to convert them back into columns! +- *Provide converters from JDBC/ODBC to Arrow*. + Some examples include [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc]. + This approach reduces the burden on client applications, but doesn't fundamentally solve the problem. + Unnecessary data conversions are still required. +- *Provide special SDKs*. + All of the columnar systems listed above do offer ways to get Arrow data, such as via [Arrow Flight SQL][flight-sql]. + But client applications need to spend time to integrate with each of them. + (Just look at all the [connectors](https://trino.io/docs/current/connector.html) that Trino implements.) + And not every system offers this option. + +ADBC combines the advantages of the latter two solutions under one API. +In other words, ADBC provides a set of API definitions that client applications code to. +These API definitions are Arrow-based. +The application then links to or loads drivers for the actual database, which implement the API definitions. +If the database is Arrow-native, the driver can just pass the data through without conversion. +Otherwise, the driver converts the data to Arrow format first. + + + + The query execution flow with two different ADBC drivers. + + +1. The application submits
[GitHub] [arrow-site] lidavidm commented on a diff in pull request #248: [Website] Add ADBC blog post
lidavidm commented on code in PR #248: URL: https://github.com/apache/arrow-site/pull/248#discussion_r1053650752 ## _posts/2022-12-31-arrow-adbc.md: ## @@ -0,0 +1,252 @@ +--- +layout: post +title: "Introducing ADBC: Database Access for Apache Arrow" +date: "2022-12-31 00:00:00" +author: pmc +categories: [application] +--- + + +The Arrow community would like to introduce version 1.0.0 of the [Arrow Database Connectivity (ADBC)][adbc] specification. +**ADBC aims to be an columnar, minimal-overhead alternative JDBC/ODBC for analytical applications**. +It defines vendor-agnostic and Arrow-based APIs for common database tasks, like executing queries and getting basic metadata. +These APIs are available, either directly or via bindings, in C/C++, Go, Java, Python, Ruby, and soon R. + +With ADBC, developers get both the benefits of using columnar Arrow data and having generic API abstractions. +Like [JDBC][jdbc]/[ODBC][odbc], ADBC defines database-independent interaction APIs, and relies on drivers to implement those APIs for particular databases. +ADBC aims to bring all of these together under a single API: + +- Vendor-specific Arrow-native protocols, like [Arrow Flight SQL][flight-sql] or those offered by ClickHouse or Google BigQuery; +- Non-columnar protocols, like the PostgreSQL wire format; +- Non-columnar API abstractions, like JDBC/ODBC. + +In other words: **ADBC is a single API for getting Arrow data in and out of databases**. Review Comment: Right, three classes within a single API. I'll think about rewording this a bit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] ianmcook merged pull request #290: [Website] Add ADBC to Subprojects menu
ianmcook merged PR #290: URL: https://github.com/apache/arrow-site/pull/290 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-site] github-actions[bot] commented on pull request #290: [Website] Add ADBC to Subprojects menu
github-actions[bot] commented on PR #290: URL: https://github.com/apache/arrow-site/pull/290#issuecomment-1359609916 Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Then could you also rename pull request title in the following format? ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY} See also: * [Other pull requests](https://github.com/apache/arrow-site/pulls/) * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org