from:"GitBox"

[GitHub] [arrow-site] thisisnic commented on a diff in pull request #300: [Website] Version 11.0.0 blog post

2023-01-20 Thread GitBox



thisisnic commented on code in PR #300:
URL: https://github.com/apache/arrow-site/pull/300#discussion_r1082814781


##
_posts/2023-01-18-11.0.0-release.md:
##
@@ -0,0 +1,119 @@
+---
+layout: post
+title: "Apache Arrow 11.0.0 Release"
+date: "2023-01-18 00:00:00"
+author: pmc
+categories: [release]
+---
+
+
+
+The Apache Arrow team is pleased to announce the 11.0.0 release. This covers
+over 3 months of development work and includes [**423 resolved issues**][1]
+from [**95 distinct contributors**][2]. See the [Install 
Page](https://arrow.apache.org/install/)
+to learn how to get the libraries for your platform.
+
+The release notes below are not exhaustive and only expose selected highlights
+of the release. Many other bugfixes and improvements have been made: we refer
+you to the [complete changelog][3].
+
+## Community
+
+Since the 10.0.0 release, Ben Baumgold, Will Jones, Eric Patrick Hanson,
+Curtis Vogt, Yang Jiang, Jarrett Revels, Raúl Cumplido, Jacob Wujciak,
+Jie Wen and Brent Gardner have been invited to be committers.
+Kun Liu have joined the Project Management Committee (PMC).
+
+As per our newly started tradition of rotating the PMC chair once a year
+Andrew Lamb was elected as the new PMC chair and VP.
+
+Thanks for your contributions and participation in the project!
+
+## Columnar Format Notes
+
+## Arrow Flight RPC notes
+
+In the C++/Python Flight clients, DoAction now properly streams the results, 
instead of blocking until the call finishes. Applications that did not consume 
the iterator before should fully consume the result. 
([#15069](https://github.com/apache/arrow/issues/15069))
+
+## C++ notes
+
+## C# notes
+
+No major changes to C#.
+
+## Go notes
+* Go's benchmarks will now get added to [Conbench](https://conbench.ursa.dev) 
alongside the benchmarks for other implementations 
(GH-32983)[https://github.com/apache/arrow/issues/32983]
+* Exposed FlightService_ServiceDesc and RegisterFlightServiceServer to allow 
easily incorporating a flight service into an existing gRPC server 
(GH-15174)[https://github.com/apache/arrow/issues/15174]
+
+### Arrow
+* Function `ApproxEquals` was implemented for scalar values 
(GH-29581)[https://github.com/apache/arrow/issues/29581]
+* `UnmarshalJSON` for the `RecordBuilder` now properly handles extra unknown 
fields with complex/nested values 
(GH-31840)[https://github.com/apache/arrow/issues/31840]
+* Decimal128 and Decimal256 type support has been added to the CSV reader 
(GH-33111)[https://github.com/apache/arrow/issues/33111]
+* Fixed bug in `array.UnionBuilder` where `Len` method always returned 0 
(GH-14775)[https://github.com/apache/arrow/issues/14775]
+* Fixed bug for handling slices of Map arrays when marshalling to JSON and for 
IPC (GH-14780)[https://github.com/apache/arrow/issues/14780]
+* Fixed memory leak when compressing IPC message body buffers 
(GH-14883)[https://github.com/apache/arrow/issues/14883]
+* Added the ability to easily append scalar values to array builders 
(GH-15005)[https://github.com/apache/arrow/issues/15005]
+
+ Compute
+* Scalar binary (add/subtract/multiply/divide/etc.) and unary arithmetic 
(abs/neg/sqrt/sign/etc.) has been implemented for the compute package  
(GH-33086)[https://github.com/apache/arrow/issues/33086] this includes easy 
functions like `compute.Add` and `compute.Divide` etc.
+* Scalar boolean functions like AND/OR/XOR/etc. have been implemented for 
compute (GH-33279)[https://github.com/apache/arrow/issues/33279]
+* Scalar comparison function kernels have been implemented for compute 
(equal/greater/greater_equal/less/less_equal) 
(GH-33308)[https://github.com/apache/arrow/issues/33308]
+* Scalar compute functions are compatible with dictionary encoded arrays by 
casting them to their value types 
(GH-33502)[https://github.com/apache/arrow/issues/33502]
+
+### Parquet
+* Panic when decoding a delta_bit_packed encoded column has been fixed 
(GH-33483)[https://github.com/apache/arrow/issues/33483]
+* Fixed memory leak from Allocator in `pqarrow.WriteArrowToColumn` 
(GH-14865)[https://github.com/apache/arrow/issues/14865]
+* Fixed `writer.WriteBatch` to properly handle writing encrypted parquet 
columns and no longer silently fail, but instead propagate an error 
(GH-14940)[https://github.com/apache/arrow/issues/14940]
+
+## Java notes
+
+## JavaScript notes
+
+* Bugfixes and dependency updates.
+* Arrow now requires BigInt support. 
[GH-33681](https://github.com/apache/arrow/pull/33682)
+
+## Python notes
+
+New features and improvements:
+
+* Numpy conversion for ListArray is improved taking into account sliced offset 
[(GH-20512)](https://github.com/apache/arrow/issues/20512)
+* DataFrame Interchange Protocol is implemented for ``pyarrow.Table`` 
([GH-33346](https://github.com/apache/arrow/issues/33346)).
+
+## R notes
+
+For more on what’s in the 11.0.0 R package, see the [R changelog][4].

Review Comment:
   ```suggestion
   * map_batches() is lazy by default; it now

[GitHub] [arrow-site] raulcd commented on a diff in pull request #300: [Website] Version 11.0.0 blog post

2023-01-20 Thread GitBox



raulcd commented on code in PR #300:
URL: https://github.com/apache/arrow-site/pull/300#discussion_r1082800466


##
_posts/2023-01-18-11.0.0-release.md:
##
@@ -0,0 +1,82 @@
+---
+layout: post
+title: "Apache Arrow 11.0.0 Release"
+date: "2023-01-18 00:00:00"
+author: pmc
+categories: [release]
+---
+
+
+
+The Apache Arrow team is pleased to announce the 11.0.0 release. This covers
+over 3 months of development work and includes [**423 resolved issues**][1]
+from [**95 distinct contributors**][2]. See the [Install 
Page](https://arrow.apache.org/install/)
+to learn how to get the libraries for your platform.
+
+The release notes below are not exhaustive and only expose selected highlights
+of the release. Many other bugfixes and improvements have been made: we refer
+you to the [complete changelog][3].
+
+## Community
+
+Since the 10.0.0 release, Ben Baumgold, Will Jones, Eric Patrick Hanson,
+Curtis Vogt, Yang Jiang, Jarrett Revels, Raúl Cumplido, Jacob Wujciak,
+Jie Wen and Brent Gardner have been invited to be committers.
+Kun Liu have joined the Project Management Committee (PMC).

Review Comment:
   Thanks, I've added a note about the new PMC chair. I've mainly taken it from 
the announcement email.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] alamb merged pull request #294: [WEBSITE] DataFusion 16.0.0 blog post

2023-01-19 Thread GitBox



alamb merged PR #294:
URL: https://github.com/apache/arrow-site/pull/294


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] kou commented on a diff in pull request #300: [Website] Version 11.0.0 blog post

2023-01-18 Thread GitBox



kou commented on code in PR #300:
URL: https://github.com/apache/arrow-site/pull/300#discussion_r1080717581


##
_posts/2023-01-18-11.0.0-release.md:
##
@@ -0,0 +1,82 @@
+---
+layout: post
+title: "Apache Arrow 11.0.0 Release"
+date: "2023-01-18 00:00:00"
+author: pmc
+categories: [release]
+---
+
+
+
+The Apache Arrow team is pleased to announce the 11.0.0 release. This covers
+over 3 months of development work and includes [**423 resolved issues**][1]
+from [**95 distinct contributors**][2]. See the [Install 
Page](https://arrow.apache.org/install/)
+to learn how to get the libraries for your platform.
+
+The release notes below are not exhaustive and only expose selected highlights
+of the release. Many other bugfixes and improvements have been made: we refer
+you to the [complete changelog][3].
+
+## Community
+
+Since the 10.0.0 release, Ben Baumgold, Will Jones, Eric Patrick Hanson,
+Curtis Vogt, Yang Jiang, Jarrett Revels, Raúl Cumplido, Jacob Wujciak,
+Jie Wen and Brent Gardner have been invited to be committers.
+Kun Liu have joined the Project Management Committee (PMC).

Review Comment:
   > Could you help me validate that?
   
   Valid!
   
   >  Also, let me know if you want me to add a note about the PMC rotation 
here.
   
   Yes, please.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] alamb commented on a diff in pull request #299: MINOR: [Website] Reword ADBC announcement

2023-01-18 Thread GitBox



alamb commented on code in PR #299:
URL: https://github.com/apache/arrow-site/pull/299#discussion_r1073941041


##
_posts/2023-01-05-introducing-arrow-adbc.md:
##
@@ -144,7 +144,7 @@ ADBC fills a specific niche that related projects do not 
address. It is both:
 
   
   Vendor-neutral 
(database APIs)
-  Vendor-specific 
(database protocols)
+  Database 
protocols

Review Comment:
   Maybe a better phrase would be "Database specific protocols"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] domoritz commented on a diff in pull request #300: [Website] Version 11.0.0 blog post

2023-01-18 Thread GitBox



domoritz commented on code in PR #300:
URL: https://github.com/apache/arrow-site/pull/300#discussion_r1073916810


##
_posts/2023-01-18-11.0.0-release.md:
##
@@ -0,0 +1,82 @@
+---
+layout: post
+title: "Apache Arrow 11.0.0 Release"
+date: "2023-01-18 00:00:00"
+author: pmc
+categories: [release]
+---
+
+
+
+The Apache Arrow team is pleased to announce the 11.0.0 release. This covers
+over 3 months of development work and includes [**423 resolved issues**][1]
+from [**95 distinct contributors**][2]. See the [Install 
Page](https://arrow.apache.org/install/)
+to learn how to get the libraries for your platform.
+
+The release notes below are not exhaustive and only expose selected highlights
+of the release. Many other bugfixes and improvements have been made: we refer
+you to the [complete changelog][3].
+
+## Community
+
+Since the 10.0.0 release, Ben Baumgold, Will Jones, Eric Patrick Hanson,
+Curtis Vogt, Yang Jiang, Jarrett Revels, Raúl Cumplido, Jacob Wujciak,
+Jie Wen and Brent Gardner have been invited to be committers.
+Kun Liu have joined the Project Management Committee (PMC).
+
+Thanks for your contributions and participation in the project!
+
+## Columnar Format Notes
+
+## Arrow Flight RPC notes
+
+## C++ notes
+
+## C# notes
+
+## Go notes
+
+## Java notes
+
+## JavaScript notes

Review Comment:
   ```suggestion
   ## JavaScript notes
   
   * Bugfixes and dependency updates.
   * Arrow now requires BigInt support. 
[GH-33681](https://github.com/apache/arrow/pull/33682)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] eerhardt commented on a diff in pull request #300: [Website] Version 11.0.0 blog post

2023-01-18 Thread GitBox



eerhardt commented on code in PR #300:
URL: https://github.com/apache/arrow-site/pull/300#discussion_r1073859493


##
_posts/2023-01-18-11.0.0-release.md:
##
@@ -0,0 +1,82 @@
+---
+layout: post
+title: "Apache Arrow 11.0.0 Release"
+date: "2023-01-18 00:00:00"
+author: pmc
+categories: [release]
+---
+
+
+
+The Apache Arrow team is pleased to announce the 11.0.0 release. This covers
+over 3 months of development work and includes [**423 resolved issues**][1]
+from [**95 distinct contributors**][2]. See the [Install 
Page](https://arrow.apache.org/install/)
+to learn how to get the libraries for your platform.
+
+The release notes below are not exhaustive and only expose selected highlights
+of the release. Many other bugfixes and improvements have been made: we refer
+you to the [complete changelog][3].
+
+## Community
+
+Since the 10.0.0 release, Ben Baumgold, Will Jones, Eric Patrick Hanson,
+Curtis Vogt, Yang Jiang, Jarrett Revels, Raúl Cumplido, Jacob Wujciak,
+Jie Wen and Brent Gardner have been invited to be committers.
+Kun Liu have joined the Project Management Committee (PMC).
+
+Thanks for your contributions and participation in the project!
+
+## Columnar Format Notes
+
+## Arrow Flight RPC notes
+
+## C++ notes
+
+## C# notes

Review Comment:
   No, there hasn't been any C# changes of note in 11.0.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] zeroshade commented on a diff in pull request #300: [Website] Version 11.0.0 blog post

2023-01-18 Thread GitBox



zeroshade commented on code in PR #300:
URL: https://github.com/apache/arrow-site/pull/300#discussion_r1073769521


##
_posts/2023-01-18-11.0.0-release.md:
##
@@ -0,0 +1,82 @@
+---
+layout: post
+title: "Apache Arrow 11.0.0 Release"
+date: "2023-01-18 00:00:00"
+author: pmc
+categories: [release]
+---
+
+
+
+The Apache Arrow team is pleased to announce the 11.0.0 release. This covers
+over 3 months of development work and includes [**423 resolved issues**][1]
+from [**95 distinct contributors**][2]. See the [Install 
Page](https://arrow.apache.org/install/)
+to learn how to get the libraries for your platform.
+
+The release notes below are not exhaustive and only expose selected highlights
+of the release. Many other bugfixes and improvements have been made: we refer
+you to the [complete changelog][3].
+
+## Community
+
+Since the 10.0.0 release, Ben Baumgold, Will Jones, Eric Patrick Hanson,
+Curtis Vogt, Yang Jiang, Jarrett Revels, Raúl Cumplido, Jacob Wujciak,
+Jie Wen and Brent Gardner have been invited to be committers.
+Kun Liu have joined the Project Management Committee (PMC).
+
+Thanks for your contributions and participation in the project!
+
+## Columnar Format Notes
+
+## Arrow Flight RPC notes
+
+## C++ notes
+
+## C# notes
+
+## Go notes

Review Comment:
   ```suggestion
   ## Go notes
   * Go's benchmarks will now get added to 
[Conbench](https://conbench.ursa.dev) alongside the benchmarks for other 
implementations (GH-32983)[https://github.com/apache/arrow/issues/32983]
   * Exposed FlightService_ServiceDesc and RegisterFlightServiceServer to allow 
easily incorporating a flight service into an existing gRPC server 
(GH-15174)[https://github.com/apache/arrow/issues/15174]
   
   ### Arrow
   * Function `ApproxEquals` was implemented for scalar values 
(GH-29581)[https://github.com/apache/arrow/issues/29581]
   * `UnmarshalJSON` for the `RecordBuilder` now properly handles extra unknown 
fields with complex/nested values 
(GH-31840)[https://github.com/apache/arrow/issues/31840]
   * Decimal128 and Decimal256 type support has been added to the CSV reader 
(GH-33111)[https://github.com/apache/arrow/issues/33111]
   * Fixed bug in `array.UnionBuilder` where `Len` method always returned 0 
(GH-14775)[https://github.com/apache/arrow/issues/14775]
   * Fixed bug for handling slices of Map arrays when marshalling to JSON and 
for IPC (GH-14780)[https://github.com/apache/arrow/issues/14780]
   * Fixed memory leak when compressing IPC message body buffers 
(GH-14883)[https://github.com/apache/arrow/issues/14883]
   * Added the ability to easily append scalar values to array builders 
(GH-15005)[https://github.com/apache/arrow/issues/15005]
   
    Compute
   * Scalar binary (add/subtract/multiply/divide/etc.) and unary arithmetic 
(abs/neg/sqrt/sign/etc.) has been implemented for the compute package  
(GH-33086)[https://github.com/apache/arrow/issues/33086] this includes easy 
functions like `compute.Add` and `compute.Divide` etc.
   * Scalar boolean functions like AND/OR/XOR/etc. have been implemented for 
compute (GH-33279)[https://github.com/apache/arrow/issues/33279]
   * Scalar comparison function kernels have been implemented for compute 
(equal/greater/greater_equal/less/less_equal) 
(GH-33308)[https://github.com/apache/arrow/issues/33308]
   * Scalar compute functions are compatible with dictionary encoded arrays by 
casting them to their value types 
(GH-33502)[https://github.com/apache/arrow/issues/33502]
   
   ### Parquet
   * Panic when decoding a delta_bit_packed encoded column has been fixed 
(GH-33483)[https://github.com/apache/arrow/issues/33483]
   * Fixed memory leak from Allocator in `pqarrow.WriteArrowToColumn` 
(GH-14865)[https://github.com/apache/arrow/issues/14865]
   * Fixed `writer.WriteBatch` to properly handle writing encrypted parquet 
columns and no longer silently fail, but instead propagate an error 
(GH-14940)[https://github.com/apache/arrow/issues/14940]
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] lidavidm commented on a diff in pull request #300: [Website] Version 11.0.0 blog post

2023-01-18 Thread GitBox



lidavidm commented on code in PR #300:
URL: https://github.com/apache/arrow-site/pull/300#discussion_r1073512952


##
_posts/2023-01-18-11.0.0-release.md:
##
@@ -0,0 +1,82 @@
+---
+layout: post
+title: "Apache Arrow 11.0.0 Release"
+date: "2023-01-18 00:00:00"
+author: pmc
+categories: [release]
+---
+
+
+
+The Apache Arrow team is pleased to announce the 11.0.0 release. This covers
+over 3 months of development work and includes [**423 resolved issues**][1]
+from [**95 distinct contributors**][2]. See the [Install 
Page](https://arrow.apache.org/install/)
+to learn how to get the libraries for your platform.
+
+The release notes below are not exhaustive and only expose selected highlights
+of the release. Many other bugfixes and improvements have been made: we refer
+you to the [complete changelog][3].
+
+## Community
+
+Since the 10.0.0 release, Ben Baumgold, Will Jones, Eric Patrick Hanson,
+Curtis Vogt, Yang Jiang, Jarrett Revels, Raúl Cumplido, Jacob Wujciak,
+Jie Wen and Brent Gardner have been invited to be committers.
+Kun Liu have joined the Project Management Committee (PMC).
+
+Thanks for your contributions and participation in the project!
+
+## Columnar Format Notes
+
+## Arrow Flight RPC notes

Review Comment:
   ```suggestion
   ## Arrow Flight RPC notes
   
   In the C++/Python Flight clients, DoAction now properly streams the results, 
instead of blocking until the call finishes. Applications that did not consume 
the iterator before should fully consume the result. 
([#15069](https://github.com/apache/arrow/issues/15069))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] AlenkaF commented on a diff in pull request #300: [Website] Version 11.0.0 blog post

2023-01-18 Thread GitBox



AlenkaF commented on code in PR #300:
URL: https://github.com/apache/arrow-site/pull/300#discussion_r1073474010


##
_posts/2023-01-18-11.0.0-release.md:
##
@@ -0,0 +1,82 @@
+---
+layout: post
+title: "Apache Arrow 11.0.0 Release"
+date: "2023-01-18 00:00:00"
+author: pmc
+categories: [release]
+---
+
+
+
+The Apache Arrow team is pleased to announce the 11.0.0 release. This covers
+over 3 months of development work and includes [**423 resolved issues**][1]
+from [**95 distinct contributors**][2]. See the [Install 
Page](https://arrow.apache.org/install/)
+to learn how to get the libraries for your platform.
+
+The release notes below are not exhaustive and only expose selected highlights
+of the release. Many other bugfixes and improvements have been made: we refer
+you to the [complete changelog][3].
+
+## Community
+
+Since the 10.0.0 release, Ben Baumgold, Will Jones, Eric Patrick Hanson,
+Curtis Vogt, Yang Jiang, Jarrett Revels, Raúl Cumplido, Jacob Wujciak,
+Jie Wen and Brent Gardner have been invited to be committers.
+Kun Liu have joined the Project Management Committee (PMC).
+
+Thanks for your contributions and participation in the project!
+
+## Columnar Format Notes
+
+## Arrow Flight RPC notes
+
+## C++ notes
+
+## C# notes
+
+## Go notes
+
+## Java notes
+
+## JavaScript notes
+
+## Python notes

Review Comment:
   ```suggestion
   ## Python notes
   
   New features and improvements:
   
   * Numpy conversion for ListArray is improved taking into account sliced 
offset [(GH-20512)](https://github.com/apache/arrow/issues/20512)
   * DataFrame Interchange Protocol is implemented for ``pyarrow.Table`` 
([GH-33346](https://github.com/apache/arrow/issues/33346)).
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] raulcd commented on a diff in pull request #300: [Website] Version 11.0.0 blog post

2023-01-18 Thread GitBox



raulcd commented on code in PR #300:
URL: https://github.com/apache/arrow-site/pull/300#discussion_r1073442447


##
_posts/2023-01-18-11.0.0-release.md:
##
@@ -0,0 +1,89 @@
+---
+layout: post
+title: "Apache Arrow 11.0.0 Release"
+date: "2023-01-18 00:00:00"
+author: pmc
+categories: [release]
+---
+
+
+
+The Apache Arrow team is pleased to announce the 11.0.0 release. This covers
+over 3 months of development work and includes [**423 resolved issues**][1]
+from [**95 distinct contributors**][2]. See the [Install 
Page](https://arrow.apache.org/install/)

Review Comment:
   these numbers might vary with final release. @raulcd to validate before 
publishing.



##
_posts/2023-01-18-11.0.0-release.md:
##
@@ -0,0 +1,82 @@
+---
+layout: post
+title: "Apache Arrow 11.0.0 Release"
+date: "2023-01-18 00:00:00"
+author: pmc
+categories: [release]
+---
+
+
+
+The Apache Arrow team is pleased to announce the 11.0.0 release. This covers
+over 3 months of development work and includes [**423 resolved issues**][1]
+from [**95 distinct contributors**][2]. See the [Install 
Page](https://arrow.apache.org/install/)
+to learn how to get the libraries for your platform.
+
+The release notes below are not exhaustive and only expose selected highlights
+of the release. Many other bugfixes and improvements have been made: we refer
+you to the [complete changelog][3].
+
+## Community
+
+Since the 10.0.0 release, Ben Baumgold, Will Jones, Eric Patrick Hanson,
+Curtis Vogt, Yang Jiang, Jarrett Revels, Raúl Cumplido, Jacob Wujciak,
+Jie Wen and Brent Gardner have been invited to be committers.
+Kun Liu have joined the Project Management Committee (PMC).
+
+Thanks for your contributions and participation in the project!
+
+## Columnar Format Notes
+
+## Arrow Flight RPC notes
+
+## C++ notes

Review Comment:
   @pitrou can you help with the notes?



##
_posts/2023-01-18-11.0.0-release.md:
##
@@ -0,0 +1,82 @@
+---
+layout: post
+title: "Apache Arrow 11.0.0 Release"
+date: "2023-01-18 00:00:00"
+author: pmc
+categories: [release]
+---
+
+
+
+The Apache Arrow team is pleased to announce the 11.0.0 release. This covers
+over 3 months of development work and includes [**423 resolved issues**][1]
+from [**95 distinct contributors**][2]. See the [Install 
Page](https://arrow.apache.org/install/)
+to learn how to get the libraries for your platform.
+
+The release notes below are not exhaustive and only expose selected highlights
+of the release. Many other bugfixes and improvements have been made: we refer
+you to the [complete changelog][3].
+
+## Community
+
+Since the 10.0.0 release, Ben Baumgold, Will Jones, Eric Patrick Hanson,
+Curtis Vogt, Yang Jiang, Jarrett Revels, Raúl Cumplido, Jacob Wujciak,
+Jie Wen and Brent Gardner have been invited to be committers.
+Kun Liu have joined the Project Management Committee (PMC).
+
+Thanks for your contributions and participation in the project!
+
+## Columnar Format Notes
+
+## Arrow Flight RPC notes
+
+## C++ notes
+
+## C# notes
+
+## Go notes
+
+## Java notes
+
+## JavaScript notes

Review Comment:
   @domoritz any notes for the 11.0.0 release?



##
_posts/2023-01-18-11.0.0-release.md:
##
@@ -0,0 +1,82 @@
+---
+layout: post
+title: "Apache Arrow 11.0.0 Release"
+date: "2023-01-18 00:00:00"
+author: pmc
+categories: [release]
+---
+
+
+
+The Apache Arrow team is pleased to announce the 11.0.0 release. This covers
+over 3 months of development work and includes [**423 resolved issues**][1]
+from [**95 distinct contributors**][2]. See the [Install 
Page](https://arrow.apache.org/install/)
+to learn how to get the libraries for your platform.
+
+The release notes below are not exhaustive and only expose selected highlights
+of the release. Many other bugfixes and improvements have been made: we refer
+you to the [complete changelog][3].
+
+## Community
+
+Since the 10.0.0 release, Ben Baumgold, Will Jones, Eric Patrick Hanson,
+Curtis Vogt, Yang Jiang, Jarrett Revels, Raúl Cumplido, Jacob Wujciak,
+Jie Wen and Brent Gardner have been invited to be committers.
+Kun Liu have joined the Project Management Committee (PMC).
+
+Thanks for your contributions and participation in the project!
+
+## Columnar Format Notes
+
+## Arrow Flight RPC notes

Review Comment:
   @lidavidm can you help with the release notes?



##
_posts/2023-01-18-11.0.0-release.md:
##
@@ -0,0 +1,82 @@
+---
+layout: post
+title: "Apache Arrow 11.0.0 Release"
+date: "2023-01-18 00:00:00"
+author: pmc
+categories: [release]
+---
+
+
+
+The Apache Arrow team is pleased to announce the 11.0.0 release. This covers
+over 3 months of development work and includes [**423 resolved issues**][1]
+from [**95 distinct contributors**][2]. See the [Install 
Page](https://arrow.apache.org/install/)
+to learn how to get the libraries for your platform.
+
+The release notes below are not exhaustive and only expose selected highlights
+of the release. Many other

[GitHub] [arrow-site] github-actions[bot] commented on pull request #300: [Website] Version 11.0.0 blog post

2023-01-18 Thread GitBox



github-actions[bot] commented on PR #300:
URL: https://github.com/apache/arrow-site/pull/300#issuecomment-1386935967

   
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
   ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
 * [Other pull requests](https://github.com/apache/arrow-site/pulls/)
 * [Contribution Guidelines - How to contribute 
patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] raulcd opened a new pull request, #300: [Website] Version 11.0.0 blog post

2023-01-18 Thread GitBox



raulcd opened a new pull request, #300:
URL: https://github.com/apache/arrow-site/pull/300

   PR to start adding the blog post information for the Release 11.0.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] lidavidm commented on a diff in pull request #299: MINOR: [Website] Reword ADBC announcement

2023-01-17 Thread GitBox



lidavidm commented on code in PR #299:
URL: https://github.com/apache/arrow-site/pull/299#discussion_r1072916029


##
_posts/2023-01-05-introducing-arrow-adbc.md:
##
@@ -66,10 +66,10 @@ Developers have a few options:
   Libraries like [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc] handle 
row-to-columnar conversions for clients.
   But this doesn't fundamentally solve the problem.
   Unnecessary data conversions are still required.
-- *Use vendor-specific protocols*.
-  For some databases, applications can use a database-specific protocol or SDK 
to directly get Arrow data.
-  For example, applications could use Dremio via [Arrow Flight 
SQL][flight-sql].
-  But client applications that want to support multiple database vendors would 
need to integrate with each of them.
+- *Directly use database protocols*.
+  For some databases, applications can use a database protocol or SDK to 
directly get Arrow data.
+  For example, applications could use be written with [Arrow Flight 
SQL][flight-sql] to connect to Dremio and other databases that support the 
Flight SQL protocol.

Review Comment:
   ```suggestion
 For example, applications could use [Arrow Flight SQL][flight-sql] to 
connect to Dremio and other databases that support the Flight SQL protocol.
   ```
   
   (If you want, I think it's fair to link "Dremio" to the website as well.)



##
_posts/2023-01-05-introducing-arrow-adbc.md:
##
@@ -144,7 +144,7 @@ ADBC fills a specific niche that related projects do not 
address. It is both:
 
   
   Vendor-neutral 
(database APIs)
-  Vendor-specific 
(database protocols)
+  Database 
protocols

Review Comment:
   I think it's still fair to call them vendor-specific; after all, multiple 
databases also use the PostgreSQL protocol (it just doesn't have a generic 
name). Maybe "varies by vendor (database protocols)"?



##
_posts/2023-01-05-introducing-arrow-adbc.md:
##
@@ -66,10 +66,10 @@ Developers have a few options:
   Libraries like [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc] handle 
row-to-columnar conversions for clients.
   But this doesn't fundamentally solve the problem.
   Unnecessary data conversions are still required.
-- *Use vendor-specific protocols*.
-  For some databases, applications can use a database-specific protocol or SDK 
to directly get Arrow data.
-  For example, applications could use Dremio via [Arrow Flight 
SQL][flight-sql].
-  But client applications that want to support multiple database vendors would 
need to integrate with each of them.
+- *Directly use database protocols*.
+  For some databases, applications can use a database protocol or SDK to 
directly get Arrow data.
+  For example, applications could use be written with [Arrow Flight 
SQL][flight-sql] to connect to Dremio and other databases that support the 
Flight SQL protocol.
+  But not all databases support the Flight SQL protocol. An example is Google 
BigQuery, which has a separate SDK that returns Arrow data. In this case, 
client applications that want to support additional protocols would need to 
integrate with each of them.

Review Comment:
   ```suggestion
 But not all databases support Flight SQL, even if they support Arrow data. 
An example is Google BigQuery, which has a separate SDK that returns Arrow 
data. In this case, client applications that want to support additional 
databases would need to integrate with each of their protocols.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] github-actions[bot] commented on pull request #299: MINOR: [Website] Reword ADBC announcement

2023-01-17 Thread GitBox



github-actions[bot] commented on PR #299:
URL: https://github.com/apache/arrow-site/pull/299#issuecomment-1386176466

   
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
   ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
 * [Other pull requests](https://github.com/apache/arrow-site/pulls/)
 * [Contribution Guidelines - How to contribute 
patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] jduo opened a new pull request, #299: MINOR: [Website] Reword ADBC announcement

2023-01-17 Thread GitBox



jduo opened a new pull request, #299:
URL: https://github.com/apache/arrow-site/pull/299

   Reword the ADBC announcement such that Flight SQL is more clearly specified 
as being database-agnostic rather than vendor-specific.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] alamb commented on pull request #294: [WEBSITE] DataFusion 16.0.0 blog post

2023-01-17 Thread GitBox



alamb commented on PR #294:
URL: https://github.com/apache/arrow-site/pull/294#issuecomment-1386065374

   I plan to merge this tomorrow unless there are any other comments


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] alamb merged pull request #298: [WEBSITE]: Add Jie Wen / jackwener to commiters list

2023-01-14 Thread GitBox



alamb merged PR #298:
URL: https://github.com/apache/arrow-site/pull/298


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] alamb merged pull request #297: [WEBSITE]: Add Brent Gardner / avantgardnerio to committers list

2023-01-14 Thread GitBox



alamb merged PR #297:
URL: https://github.com/apache/arrow-site/pull/297


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] alamb commented on a diff in pull request #298: [WEBSITE]: Add Jie Wen / jackwener to commiters list

2023-01-14 Thread GitBox



alamb commented on code in PR #298:
URL: https://github.com/apache/arrow-site/pull/298#discussion_r1070352916


##
_data/committers.yml:
##
@@ -288,6 +288,10 @@
   role: Committer
   alias: jiayuliu
   affiliation: Airbnb Inc.
+- name: Jie Wen
+  role: Committer
+  alias: jackwener
+  affiliation: TBD

Review Comment:
   ```suggestion
 affiliation: SelectDB
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] alamb commented on a diff in pull request #297: [WEBSITE]: Add Brent Gardner / avantgardnerio to committers list

2023-01-14 Thread GitBox



alamb commented on code in PR #297:
URL: https://github.com/apache/arrow-site/pull/297#discussion_r1070348733


##
_data/committers.yml:
##
@@ -220,6 +220,10 @@
   role: Committer
   alias: bkamins
   affiliation: SGH Warsaw School of Economics
+- name: Brent Gardner
+  role: Committer
+  alias: avantgardnerio
+  affiliation: TDB

Review Comment:
   ```suggestion
 affiliation: Space and Time
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] avantgardnerio commented on pull request #297: [WEBSITE]: Add Brent Gardner / avantgardnerio to committers list

2023-01-14 Thread GitBox



avantgardnerio commented on PR #297:
URL: https://github.com/apache/arrow-site/pull/297#issuecomment-1382846564

   > would you like your affiliation to be?
   
   Space and Time is sponsoring me, so it seems appropriate they get listed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] github-actions[bot] commented on pull request #298: [WEBSITE]: Add Jie Wen / jackwener to commiters list

2023-01-14 Thread GitBox



github-actions[bot] commented on PR #298:
URL: https://github.com/apache/arrow-site/pull/298#issuecomment-1382699238

   
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
   ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
 * [Other pull requests](https://github.com/apache/arrow-site/pulls/)
 * [Contribution Guidelines - How to contribute 
patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] alamb opened a new pull request, #298: [WEBSITE]: Add Jie Wen / jackwener to commiters list

2023-01-14 Thread GitBox



alamb opened a new pull request, #298:
URL: https://github.com/apache/arrow-site/pull/298

   Update https://arrow.apache.org/committers/ Per 
https://lists.apache.org/thread/o2jtvwz6v027x7k3pgdrsly2pznbrd3k
   
   @jackwener  what, if anything, would you like your affiliation to be?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] github-actions[bot] commented on pull request #297: [WEBSITE]: Add Brent Gardner / avantgardnerio to committers list

2023-01-14 Thread GitBox



github-actions[bot] commented on PR #297:
URL: https://github.com/apache/arrow-site/pull/297#issuecomment-1382699098

   
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
   ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
 * [Other pull requests](https://github.com/apache/arrow-site/pulls/)
 * [Contribution Guidelines - How to contribute 
patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] alamb opened a new pull request, #297: [WEBSITE]: Add Brent Gardner / avantgardnerio to committers list

2023-01-14 Thread GitBox



alamb opened a new pull request, #297:
URL: https://github.com/apache/arrow-site/pull/297

   Update https://arrow.apache.org/committers/ Per 
https://lists.apache.org/thread/0cqwzhnftbnbbf3x1o209dnkoz5gbqd3
   
   @avantgardnerio what, if anything, would you like your affiliation to be?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] kou merged pull request #295: [Website] Add links to UKV

2023-01-12 Thread GitBox



kou merged PR #295:
URL: https://github.com/apache/arrow-site/pull/295


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] kou commented on a diff in pull request #295: [Website] Add links to UKV

2023-01-12 Thread GitBox



kou commented on code in PR #295:
URL: https://github.com/apache/arrow-site/pull/295#discussion_r1068676015


##
use_cases.md:
##
@@ -81,7 +83,8 @@ and [others]({{ site.baseurl }}/powered_by/) also use Arrow 
similarly.
 
 The Arrow project also defines [Flight]({% post_url 
2019-09-30-introducing-arrow-flight %}),
 a client-server RPC framework to build rich services exchanging data according
-to application-defined semantics.
+to application-defined semantics. Flight RPC is used by 
[UKV](https://unum.cloud/ukv)
+to exchange tables, documents, and graphs, between server application and 
client SDKs.

Review Comment:
   Apache Arrow project provides https://arrow.apache.org/powered_by/ as a page 
for introducing third-party projects including their use case. It's mentioned 
explicitly:
   
   > To add yourself to the list, please open a [pull 
request](https://github.com/apache/arrow-site/edit/master/powered_by.md) adding 
your organization name, URL, a list of which Arrow components you are using, 
and a short description of your use case.
   
   But other pages such as  https://arrow.apache.org/use_cases/ haven't 
discussed for the purpose explicitly. If you think that Apache Arrow project 
should use https://arrow.apache.org/use_cases/ for the purpose too, could you 
start a discussion on `d...@arrow.apache.org` mailing list?
   
   https://arrow.apache.org/community/
   
   > dev@ is for discussions about contributing to the project development 
([subscribe](mailto:dev-subscr...@arrow.apache.org?subject=Subscribe), 
[unsubscribe](mailto:dev-unsubscr...@arrow.apache.org?subject=Unubscribe), 
[archives](https://lists.apache.org/list.html?d...@arrow.apache.org))
   
   FYI: Apache Software Foundation provides suggested practices related to this 
topic: https://www.apache.org/foundation/marks/linking



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] lidavidm merged pull request #296: [Website] Add ADBC release post

2023-01-12 Thread GitBox



lidavidm merged PR #296:
URL: https://github.com/apache/arrow-site/pull/296


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] lidavidm commented on pull request #296: [Website] Add ADBC release post

2023-01-12 Thread GitBox



lidavidm commented on PR #296:
URL: https://github.com/apache/arrow-site/pull/296#issuecomment-1380499228

   I'll post this later today if there's no objections.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] ashvardanian commented on a diff in pull request #295: [Website] Add links to UKV

2023-01-12 Thread GitBox



ashvardanian commented on code in PR #295:
URL: https://github.com/apache/arrow-site/pull/295#discussion_r1068146656


##
use_cases.md:
##
@@ -81,7 +83,8 @@ and [others]({{ site.baseurl }}/powered_by/) also use Arrow 
similarly.
 
 The Arrow project also defines [Flight]({% post_url 
2019-09-30-introducing-arrow-flight %}),
 a client-server RPC framework to build rich services exchanging data according
-to application-defined semantics.
+to application-defined semantics. Flight RPC is used by 
[UKV](https://unum.cloud/ukv)
+to exchange tables, documents, and graphs, between server application and 
client SDKs.

Review Comment:
   Reverted all the changes in `use_cases.md`.
   
   Apache Spark, Google BigQuery, TensorFlow, and AWS Athena were all mentioned 
in that paragraph, so I thought it might be the right place to mention them. We 
rely on Arrow representations for the same purpose but with a much broader 
scope than any of mentioned projects. Maybe we can add the reference another 
time. Thank you!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] kou commented on a diff in pull request #295: [Website] Add links to UKV

2023-01-11 Thread GitBox



kou commented on code in PR #295:
URL: https://github.com/apache/arrow-site/pull/295#discussion_r106832


##
use_cases.md:
##
@@ -81,7 +83,8 @@ and [others]({{ site.baseurl }}/powered_by/) also use Arrow 
similarly.
 
 The Arrow project also defines [Flight]({% post_url 
2019-09-30-introducing-arrow-flight %}),
 a client-server RPC framework to build rich services exchanging data according
-to application-defined semantics.
+to application-defined semantics. Flight RPC is used by 
[UKV](https://unum.cloud/ukv)
+to exchange tables, documents, and graphs, between server application and 
client SDKs.

Review Comment:
   I mean that reverting all changes in `use_cases.md`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] andygrove commented on a diff in pull request #294: [WEBSITE] DataFusion 16.0.0 blog post

2023-01-11 Thread GitBox



andygrove commented on code in PR #294:
URL: https://github.com/apache/arrow-site/pull/294#discussion_r1067412821


##
_posts/2023-01-07-datafusion-16.0.0.md:
##
@@ -0,0 +1,289 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 16.0.0 Project Update"
+date: "2023-01-07 00:00:00"
+author: pmc
+categories: [release]
+---
+
+
+# Introduction
+
+[DataFusion](https://arrow.apache.org/datafusion/) is an extensible
+query execution framework, written in [Rust](https://www.rust-lang.org/),
+that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format. It is targeted primarily at developers creating data
+intensive analytics, and offers mature
+[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html),
+a DataFrame API, and many extension points.
+
+DataFusion based systems perform very well on performance
+benchmarks, especially considering they operate on data in parquet
+files directly rather than first loading into a specialized format.
+Some recent highlights include [clickbench](https://benchmark.clickhouse.com/)
+and the [Cloudfuse.io standalone query 
engines](https://www.cloudfuse.io/dashboards/standalone-engines) page.
+
+DataFusion is part of a longer term trend, articulated clearly by [Andy 
Pavlo](http://www.cs.cmu.edu/~pavlo/) in his
+[2022 Databases 
Retrospective](https://ottertune.com/blog/2022-databases-retrospective/).
+Database frameworks are proliferating and it is likely that all OLAP DBMSs and 
other many data heavy applications such as machine learning, will require a 
vectorized, highly performant query
+engine in the next 5 years to remain relevant.
+The only practical way to make such technology so widely available
+without many millions of dollars of investment is
+though open source engine such as DataFusion or 
[Velox](https://github.com/facebookincubator/velox).
+
+The rest of this post describes the improvements made to DataFusion
+over the last three months and some hints if where we are heading.
+
+## Community Growth
+
+The three months since [our last 
update](https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/) again saw 
significant growth in the DataFusion.
+TODO quantify the growth -- e.g. XXX new contributors to the project and 
regularly merge YYY PRs a day.
+
+Growth of new systems based on as the engine in [many open source and 
commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and 
was one of the early open source projects to provide this capability.
+
+Several new databases built on datafusion (synnada.ai, greptimedb, probably 
others)
+GA of InfluxDB IOx
+
+
+The DataFusion 16.0.0 release consists of 520 PRs from 70 distinct 
contributors. This does not count all the work that goes into our dependencies 
such as [arrow](https://crates.io/crates/arrow),  
[parquet](https://crates.io/crates/parquet), and 
[object_store](https://crates.io/crates/object_store), that much of the same 
community helps nurture.
+
+
+
+## Performance 
+
+Performance and efficiency are core value propositions for
+DataFusion. While there is still a performance gap between DataFusion best of
+breed tightly, integrated systems such as [DuckDB](https://duckdb.org)
+and [Polars](https://www.pola.rs/)https://www.pola.rs/), DataFusion is
+closing the gap quickly. Performance highlights from the last three
+months:
+
+* XX% Faster Sorting and Merging using the new [Row 
Format](https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/)
+* [Advanced predicate 
pushdown](https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/),
 directly on parquet, optionally directly from object storage, enabling sub 
millisecond filtering, directly from object storage 
+* Improved `IN` expressions significantly faster  Simplify InListExpr ~20-70% 
Faster ([#4057])
+* Sort and partition aware optimizations such as  #3969 and  #4691,  skipping 
potentially expensive operations
+* Basic filter selectivity analysis (#3868)
+
+
+In the coming few months, we plan work on:
+* Improved grouping performance (TODO link)
+* bloom filtering
+* investigate RLE (Run End Encoding support) (todo Arrow link)
+* Enable predicate pushdown by default for all cases
+* OTHERS?
+
+## Runtime Resource Limits
+
+Initially, DataFusion could potentially use unbounded amounts of memory for 
certain queries that included Sorts, Grouping or Joins.
+
+In version 16.0.0, it is possible to limit DataFusion's memory usage for 
Sorting and Grouping. We are looking for help adding similar limiting for Joins 
as well as expanding our algorithms to spill to secondary storage, if 
available. See #3941 fore more detail.
+
+
+## SQL Window Function
+[SQL window functions](https://en.wikipedia.org/wiki/Window_function_(SQL))  
are useful for a variety of analysis and DataFusion's implementation is close 
to complete now.
+
+- Custom window frames such as `... OVER (ORDER BY ... RANGE BETWEEN 0.2 
PRECEDING AND 0.2

[GitHub] [arrow-site] andygrove commented on a diff in pull request #294: [WEBSITE] DataFusion 16.0.0 blog post

2023-01-11 Thread GitBox



andygrove commented on code in PR #294:
URL: https://github.com/apache/arrow-site/pull/294#discussion_r1067412293


##
_posts/2023-01-07-datafusion-16.0.0.md:
##
@@ -0,0 +1,289 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 16.0.0 Project Update"
+date: "2023-01-07 00:00:00"
+author: pmc
+categories: [release]
+---
+
+
+# Introduction
+
+[DataFusion](https://arrow.apache.org/datafusion/) is an extensible
+query execution framework, written in [Rust](https://www.rust-lang.org/),
+that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format. It is targeted primarily at developers creating data
+intensive analytics, and offers mature
+[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html),
+a DataFrame API, and many extension points.
+
+DataFusion based systems perform very well on performance
+benchmarks, especially considering they operate on data in parquet
+files directly rather than first loading into a specialized format.
+Some recent highlights include [clickbench](https://benchmark.clickhouse.com/)
+and the [Cloudfuse.io standalone query 
engines](https://www.cloudfuse.io/dashboards/standalone-engines) page.
+
+DataFusion is part of a longer term trend, articulated clearly by [Andy 
Pavlo](http://www.cs.cmu.edu/~pavlo/) in his
+[2022 Databases 
Retrospective](https://ottertune.com/blog/2022-databases-retrospective/).
+Database frameworks are proliferating and it is likely that all OLAP DBMSs and 
other many data heavy applications such as machine learning, will require a 
vectorized, highly performant query
+engine in the next 5 years to remain relevant.
+The only practical way to make such technology so widely available
+without many millions of dollars of investment is
+though open source engine such as DataFusion or 
[Velox](https://github.com/facebookincubator/velox).
+
+The rest of this post describes the improvements made to DataFusion
+over the last three months and some hints if where we are heading.
+
+## Community Growth
+
+The three months since [our last 
update](https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/) again saw 
significant growth in the DataFusion.
+TODO quantify the growth -- e.g. XXX new contributors to the project and 
regularly merge YYY PRs a day.
+
+Growth of new systems based on as the engine in [many open source and 
commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and 
was one of the early open source projects to provide this capability.
+
+Several new databases built on datafusion (synnada.ai, greptimedb, probably 
others)
+GA of InfluxDB IOx
+
+
+The DataFusion 16.0.0 release consists of 520 PRs from 70 distinct 
contributors. This does not count all the work that goes into our dependencies 
such as [arrow](https://crates.io/crates/arrow),  
[parquet](https://crates.io/crates/parquet), and 
[object_store](https://crates.io/crates/object_store), that much of the same 
community helps nurture.
+
+
+
+## Performance 
+
+Performance and efficiency are core value propositions for
+DataFusion. While there is still a performance gap between DataFusion best of
+breed tightly, integrated systems such as [DuckDB](https://duckdb.org)
+and [Polars](https://www.pola.rs/)https://www.pola.rs/), DataFusion is
+closing the gap quickly. Performance highlights from the last three
+months:
+
+* XX% Faster Sorting and Merging using the new [Row 
Format](https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/)
+* [Advanced predicate 
pushdown](https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/),
 directly on parquet, optionally directly from object storage, enabling sub 
millisecond filtering, directly from object storage 
+* Improved `IN` expressions significantly faster  Simplify InListExpr ~20-70% 
Faster ([#4057])
+* Sort and partition aware optimizations such as  #3969 and  #4691,  skipping 
potentially expensive operations
+* Basic filter selectivity analysis (#3868)
+
+
+In the coming few months, we plan work on:
+* Improved grouping performance (TODO link)
+* bloom filtering
+* investigate RLE (Run End Encoding support) (todo Arrow link)
+* Enable predicate pushdown by default for all cases
+* OTHERS?
+
+## Runtime Resource Limits
+
+Initially, DataFusion could potentially use unbounded amounts of memory for 
certain queries that included Sorts, Grouping or Joins.
+
+In version 16.0.0, it is possible to limit DataFusion's memory usage for 
Sorting and Grouping. We are looking for help adding similar limiting for Joins 
as well as expanding our algorithms to spill to secondary storage, if 
available. See #3941 fore more detail.
+
+
+## SQL Window Function
+[SQL window functions](https://en.wikipedia.org/wiki/Window_function_(SQL))  
are useful for a variety of analysis and DataFusion's implementation is close 
to complete now.
+
+- Custom window frames such as `... OVER (ORDER BY ... RANGE BETWEEN 0.2 
PRECEDING AND 0.2

[GitHub] [arrow-site] andygrove commented on a diff in pull request #294: [WEBSITE] DataFusion 16.0.0 blog post

2023-01-11 Thread GitBox



andygrove commented on code in PR #294:
URL: https://github.com/apache/arrow-site/pull/294#discussion_r1067410437


##
_posts/2023-01-07-datafusion-16.0.0.md:
##
@@ -0,0 +1,289 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 16.0.0 Project Update"
+date: "2023-01-07 00:00:00"
+author: pmc
+categories: [release]
+---
+
+
+# Introduction
+
+[DataFusion](https://arrow.apache.org/datafusion/) is an extensible
+query execution framework, written in [Rust](https://www.rust-lang.org/),
+that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format. It is targeted primarily at developers creating data
+intensive analytics, and offers mature
+[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html),
+a DataFrame API, and many extension points.
+
+DataFusion based systems perform very well on performance
+benchmarks, especially considering they operate on data in parquet
+files directly rather than first loading into a specialized format.
+Some recent highlights include [clickbench](https://benchmark.clickhouse.com/)
+and the [Cloudfuse.io standalone query 
engines](https://www.cloudfuse.io/dashboards/standalone-engines) page.
+
+DataFusion is part of a longer term trend, articulated clearly by [Andy 
Pavlo](http://www.cs.cmu.edu/~pavlo/) in his
+[2022 Databases 
Retrospective](https://ottertune.com/blog/2022-databases-retrospective/).
+Database frameworks are proliferating and it is likely that all OLAP DBMSs and 
other many data heavy applications such as machine learning, will require a 
vectorized, highly performant query
+engine in the next 5 years to remain relevant.
+The only practical way to make such technology so widely available
+without many millions of dollars of investment is
+though open source engine such as DataFusion or 
[Velox](https://github.com/facebookincubator/velox).
+
+The rest of this post describes the improvements made to DataFusion
+over the last three months and some hints if where we are heading.
+
+## Community Growth
+
+The three months since [our last 
update](https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/) again saw 
significant growth in the DataFusion.
+TODO quantify the growth -- e.g. XXX new contributors to the project and 
regularly merge YYY PRs a day.
+
+Growth of new systems based on as the engine in [many open source and 
commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and 
was one of the early open source projects to provide this capability.
+
+Several new databases built on datafusion (synnada.ai, greptimedb, probably 
others)
+GA of InfluxDB IOx
+
+
+The DataFusion 16.0.0 release consists of 520 PRs from 70 distinct 
contributors. This does not count all the work that goes into our dependencies 
such as [arrow](https://crates.io/crates/arrow),  
[parquet](https://crates.io/crates/parquet), and 
[object_store](https://crates.io/crates/object_store), that much of the same 
community helps nurture.
+
+
+
+## Performance 
+
+Performance and efficiency are core value propositions for
+DataFusion. While there is still a performance gap between DataFusion best of
+breed tightly, integrated systems such as [DuckDB](https://duckdb.org)
+and [Polars](https://www.pola.rs/)https://www.pola.rs/), DataFusion is
+closing the gap quickly. Performance highlights from the last three
+months:
+
+* XX% Faster Sorting and Merging using the new [Row 
Format](https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/)
+* [Advanced predicate 
pushdown](https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/),
 directly on parquet, optionally directly from object storage, enabling sub 
millisecond filtering, directly from object storage 
+* Improved `IN` expressions significantly faster  Simplify InListExpr ~20-70% 
Faster ([#4057])
+* Sort and partition aware optimizations such as  #3969 and  #4691,  skipping 
potentially expensive operations
+* Basic filter selectivity analysis (#3868)
+
+
+In the coming few months, we plan work on:
+* Improved grouping performance (TODO link)
+* bloom filtering
+* investigate RLE (Run End Encoding support) (todo Arrow link)
+* Enable predicate pushdown by default for all cases
+* OTHERS?
+
+## Runtime Resource Limits
+
+Initially, DataFusion could potentially use unbounded amounts of memory for 
certain queries that included Sorts, Grouping or Joins.
+
+In version 16.0.0, it is possible to limit DataFusion's memory usage for 
Sorting and Grouping. We are looking for help adding similar limiting for Joins 
as well as expanding our algorithms to spill to secondary storage, if 
available. See #3941 fore more detail.
+
+
+## SQL Window Function
+[SQL window functions](https://en.wikipedia.org/wiki/Window_function_(SQL))  
are useful for a variety of analysis and DataFusion's implementation is close 
to complete now.
+
+- Custom window frames such as `... OVER (ORDER BY ... RANGE BETWEEN 0.2 
PRECEDING AND 0.2

[GitHub] [arrow-site] ashvardanian commented on a diff in pull request #295: [Website] Add links to UKV

2023-01-11 Thread GitBox



ashvardanian commented on code in PR #295:
URL: https://github.com/apache/arrow-site/pull/295#discussion_r1066841543


##
use_cases.md:
##
@@ -81,7 +83,8 @@ and [others]({{ site.baseurl }}/powered_by/) also use Arrow 
similarly.
 
 The Arrow project also defines [Flight]({% post_url 
2019-09-30-introducing-arrow-flight %}),
 a client-server RPC framework to build rich services exchanging data according
-to application-defined semantics.
+to application-defined semantics. Flight RPC is used by 
[UKV](https://unum.cloud/ukv)
+to exchange tables, documents, and graphs, between server application and 
client SDKs.

Review Comment:
   You mean removing the link duplicate, or the whole contents of the line?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] alamb commented on a diff in pull request #294: [WEBSITE] DataFusion 16.0.0 blog post

2023-01-10 Thread GitBox



alamb commented on code in PR #294:
URL: https://github.com/apache/arrow-site/pull/294#discussion_r1066423131


##
_posts/2023-01-07-datafusion-16.0.0.md:
##
@@ -0,0 +1,308 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 16.0.0 Project Update"
+date: "2023-01-07 00:00:00"
+author: pmc
+categories: [release]
+---
+
+
+# Introduction
+
+[DataFusion](https://arrow.apache.org/datafusion/) is an extensible
+query execution framework, written in [Rust](https://www.rust-lang.org/),
+that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format. It is targeted primarily at developers creating data
+intensive analytics, and offers mature
+[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html),
+a DataFrame API, and many extension points.
+
+Systems based on DataFusion perform very well in benchmarks,
+especially considering they operate directly on parquet files rather
+than first loading into a specialized format.  Some recent highlights
+include [clickbench](https://benchmark.clickhouse.com/) and the
+[Cloudfuse.io standalone query
+engines](https://www.cloudfuse.io/dashboards/standalone-engines) page.
+
+DataFusion is also part of a longer term trend, articulated clearly by
+[Andy Pavlo](http://www.cs.cmu.edu/~pavlo/) in his [2022 Databases
+Retrospective](https://ottertune.com/blog/2022-databases-retrospective/).
+Database frameworks are proliferating and it is likely that all OLAP
+DBMSs and other data heavy applications, such as machine learning,
+will **require** a vectorized, highly performant query engine in the next
+5 years to remain relevant.  The only practical way to make such
+technology so widely available without many millions of dollars of
+investment is though open source engine such as DataFusion or
+[Velox](https://github.com/facebookincubator/velox).
+
+The rest of this post describes the improvements made to DataFusion
+over the last three months and some hints of where we are heading.
+
+
+## Community Growth
+
+We again saw significant growth in the DataFusion community since [our last 
update](https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/). There are 
some interesting metrics on 
[OSSRank](https://ossrank.com/p/1573-apache-arrow-datafusion).
+
+The DataFusion 16.0.0 release consists of 524 PRs from 70 distinct 
contributors, not including all the work that goes into dependencies such as 
[arrow](https://crates.io/crates/arrow), 
[parquet](https://crates.io/crates/parquet), and 
[object_store](https://crates.io/crates/object_store), that much of the same 
community helps support. Thank you all for your help
+
+
+Several [new systems based on 
DataFusion](https://github.com/apache/arrow-datafusion#known-uses) were 
recently added:
+
+* [Greptime DB](https://github.com/GreptimeTeam/greptimedb)
+* [Synnada](https://synnada.ai/)
+* [PRQL](https://github.com/PRQL/prql-query)
+- [Parseable](https://github.com/parseablehq/parseable)
+* [SeaFowl](https://github.com/splitgraph/seafowl)
+
+
+## Performance 
+
+Performance and efficiency are core values for
+DataFusion. While there is still a gap between DataFusion and the best of
+breed, tightly integrated systems such as [DuckDB](https://duckdb.org)
+and [Polars](https://www.pola.rs/), DataFusion is
+closing the gap quickly. Performance highlights from the last three
+months:
+
+* Up to 30% Faster Sorting and Merging using the new [Row 
Format](https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/)
+* [Advanced predicate 
pushdown](https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/),
 directly on parquet, directly from object storage, enabling sub millisecond 
filtering. 
+* `70%` faster `IN` expressions evaluation ([#4057])
+* Sort and partition aware optimizations ([#3969] and  [#4691])
+* Filter selectivity analysis ([#3868])
+
+## Runtime Resource Limits
+
+Previously, DataFusion could potentially use unbounded amounts of memory for 
certain queries that included Sorts, Grouping or Joins.
+
+In version 16.0.0, it is possible to limit DataFusion's memory usage for 
Sorting and Grouping. We are looking for help adding similar limiting for Joins 
as well as expanding our algorithms to optionally spill to secondary storage. 
See [#3941] for more detail.
+
+
+## SQL Window Functions
+
+[SQL Window Functions](https://en.wikipedia.org/wiki/Window_function_(SQL)) 
are useful for a variety of analysis and DataFusion's implementation support 
expanded significantly:
+
+- Custom window frames such as `... OVER (ORDER BY ... RANGE BETWEEN 0.2 
PRECEDING AND 0.2 FOLLOWING)`
+- Unbounded window frames such as `... OVER (ORDER BY ... RANGE UNBOUNDED ROWS 
PRECEDING)`
+- Support for the `NTILE` window function ([#4676])
+- Support for `GROUPS` mode ([#4155])
+
+
+# Improved Joins
+
+Joins are often the most complicated operations to handle well in
+analytics systems and DataFusion 16.0.0 offers significant improvements
+such

[GitHub] [arrow-site] lidavidm commented on pull request #296: [Website] Add ADBC release post

2023-01-10 Thread GitBox



lidavidm commented on PR #296:
URL: https://github.com/apache/arrow-site/pull/296#issuecomment-1377979146

   Updated, thanks Ian!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] ozankabak commented on a diff in pull request #294: [WEBSITE] DataFusion 16.0.0 blog post

2023-01-10 Thread GitBox



ozankabak commented on code in PR #294:
URL: https://github.com/apache/arrow-site/pull/294#discussion_r1066390458


##
_posts/2023-01-07-datafusion-16.0.0.md:
##
@@ -0,0 +1,308 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 16.0.0 Project Update"
+date: "2023-01-07 00:00:00"
+author: pmc
+categories: [release]
+---
+
+
+# Introduction
+
+[DataFusion](https://arrow.apache.org/datafusion/) is an extensible
+query execution framework, written in [Rust](https://www.rust-lang.org/),
+that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format. It is targeted primarily at developers creating data
+intensive analytics, and offers mature
+[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html),
+a DataFrame API, and many extension points.
+
+Systems based on DataFusion perform very well in benchmarks,
+especially considering they operate directly on parquet files rather
+than first loading into a specialized format.  Some recent highlights
+include [clickbench](https://benchmark.clickhouse.com/) and the
+[Cloudfuse.io standalone query
+engines](https://www.cloudfuse.io/dashboards/standalone-engines) page.
+
+DataFusion is also part of a longer term trend, articulated clearly by
+[Andy Pavlo](http://www.cs.cmu.edu/~pavlo/) in his [2022 Databases
+Retrospective](https://ottertune.com/blog/2022-databases-retrospective/).
+Database frameworks are proliferating and it is likely that all OLAP
+DBMSs and other data heavy applications, such as machine learning,
+will **require** a vectorized, highly performant query engine in the next
+5 years to remain relevant.  The only practical way to make such
+technology so widely available without many millions of dollars of
+investment is though open source engine such as DataFusion or
+[Velox](https://github.com/facebookincubator/velox).
+
+The rest of this post describes the improvements made to DataFusion
+over the last three months and some hints of where we are heading.
+
+
+## Community Growth
+
+We again saw significant growth in the DataFusion community since [our last 
update](https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/). There are 
some interesting metrics on 
[OSSRank](https://ossrank.com/p/1573-apache-arrow-datafusion).
+
+The DataFusion 16.0.0 release consists of 524 PRs from 70 distinct 
contributors, not including all the work that goes into dependencies such as 
[arrow](https://crates.io/crates/arrow), 
[parquet](https://crates.io/crates/parquet), and 
[object_store](https://crates.io/crates/object_store), that much of the same 
community helps support. Thank you all for your help
+
+
+Several [new systems based on 
DataFusion](https://github.com/apache/arrow-datafusion#known-uses) were 
recently added:
+
+* [Greptime DB](https://github.com/GreptimeTeam/greptimedb)
+* [Synnada](https://synnada.ai/)
+* [PRQL](https://github.com/PRQL/prql-query)
+- [Parseable](https://github.com/parseablehq/parseable)
+* [SeaFowl](https://github.com/splitgraph/seafowl)
+
+
+## Performance 
+
+Performance and efficiency are core values for
+DataFusion. While there is still a gap between DataFusion and the best of
+breed, tightly integrated systems such as [DuckDB](https://duckdb.org)
+and [Polars](https://www.pola.rs/), DataFusion is
+closing the gap quickly. Performance highlights from the last three
+months:
+
+* Up to 30% Faster Sorting and Merging using the new [Row 
Format](https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/)
+* [Advanced predicate 
pushdown](https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/),
 directly on parquet, directly from object storage, enabling sub millisecond 
filtering. 
+* `70%` faster `IN` expressions evaluation ([#4057])
+* Sort and partition aware optimizations ([#3969] and  [#4691])
+* Filter selectivity analysis ([#3868])
+
+## Runtime Resource Limits
+
+Previously, DataFusion could potentially use unbounded amounts of memory for 
certain queries that included Sorts, Grouping or Joins.
+
+In version 16.0.0, it is possible to limit DataFusion's memory usage for 
Sorting and Grouping. We are looking for help adding similar limiting for Joins 
as well as expanding our algorithms to optionally spill to secondary storage. 
See [#3941] for more detail.
+
+
+## SQL Window Functions
+
+[SQL Window Functions](https://en.wikipedia.org/wiki/Window_function_(SQL)) 
are useful for a variety of analysis and DataFusion's implementation support 
expanded significantly:
+
+- Custom window frames such as `... OVER (ORDER BY ... RANGE BETWEEN 0.2 
PRECEDING AND 0.2 FOLLOWING)`
+- Unbounded window frames such as `... OVER (ORDER BY ... RANGE UNBOUNDED ROWS 
PRECEDING)`
+- Support for the `NTILE` window function ([#4676])
+- Support for `GROUPS` mode ([#4155])
+
+
+# Improved Joins
+
+Joins are often the most complicated operations to handle well in
+analytics systems and DataFusion 16.0.0 offers significant improvements

[GitHub] [arrow-site] alamb commented on pull request #294: [WEBSITE] DataFusion 16.0.0 blog post

2023-01-10 Thread GitBox



alamb commented on PR #294:
URL: https://github.com/apache/arrow-site/pull/294#issuecomment-1377924175

   Ok I think this one is now ready for some more review -- it is plausibly 
ready to publish


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] alamb commented on a diff in pull request #294: [WEBSITE] DataFusion 16.0.0 blog post (WIP)

2023-01-10 Thread GitBox



alamb commented on code in PR #294:
URL: https://github.com/apache/arrow-site/pull/294#discussion_r1066356813


##
_posts/2023-01-07-datafusion-16.0.0.md:
##
@@ -0,0 +1,289 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 16.0.0 Project Update"
+date: "2023-01-07 00:00:00"
+author: pmc
+categories: [release]
+---
+
+
+# Introduction
+
+[DataFusion](https://arrow.apache.org/datafusion/) is an extensible
+query execution framework, written in [Rust](https://www.rust-lang.org/),
+that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format. It is targeted primarily at developers creating data
+intensive analytics, and offers mature
+[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html),
+a DataFrame API, and many extension points.
+
+DataFusion based systems perform very well on performance
+benchmarks, especially considering they operate on data in parquet
+files directly rather than first loading into a specialized format.
+Some recent highlights include [clickbench](https://benchmark.clickhouse.com/)
+and the [Cloudfuse.io standalone query 
engines](https://www.cloudfuse.io/dashboards/standalone-engines) page.
+
+DataFusion is part of a longer term trend, articulated clearly by [Andy 
Pavlo](http://www.cs.cmu.edu/~pavlo/) in his
+[2022 Databases 
Retrospective](https://ottertune.com/blog/2022-databases-retrospective/).
+Database frameworks are proliferating and it is likely that all OLAP DBMSs and 
other many data heavy applications such as machine learning, will require a 
vectorized, highly performant query
+engine in the next 5 years to remain relevant.
+The only practical way to make such technology so widely available
+without many millions of dollars of investment is
+though open source engine such as DataFusion or 
[Velox](https://github.com/facebookincubator/velox).
+
+The rest of this post describes the improvements made to DataFusion
+over the last three months and some hints if where we are heading.
+
+## Community Growth
+
+The three months since [our last 
update](https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/) again saw 
significant growth in the DataFusion.
+TODO quantify the growth -- e.g. XXX new contributors to the project and 
regularly merge YYY PRs a day.
+
+Growth of new systems based on as the engine in [many open source and 
commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and 
was one of the early open source projects to provide this capability.
+
+Several new databases built on datafusion (synnada.ai, greptimedb, probably 
others)
+GA of InfluxDB IOx
+
+
+The DataFusion 16.0.0 release consists of 520 PRs from 70 distinct 
contributors. This does not count all the work that goes into our dependencies 
such as [arrow](https://crates.io/crates/arrow),  
[parquet](https://crates.io/crates/parquet), and 
[object_store](https://crates.io/crates/object_store), that much of the same 
community helps nurture.
+
+
+
+## Performance 
+
+Performance and efficiency are core value propositions for
+DataFusion. While there is still a performance gap between DataFusion best of
+breed tightly, integrated systems such as [DuckDB](https://duckdb.org)
+and [Polars](https://www.pola.rs/)https://www.pola.rs/), DataFusion is
+closing the gap quickly. Performance highlights from the last three
+months:
+
+* XX% Faster Sorting and Merging using the new [Row 
Format](https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/)
+* [Advanced predicate 
pushdown](https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/),
 directly on parquet, optionally directly from object storage, enabling sub 
millisecond filtering, directly from object storage 
+* Improved `IN` expressions significantly faster  Simplify InListExpr ~20-70% 
Faster ([#4057])
+* Sort and partition aware optimizations such as  #3969 and  #4691,  skipping 
potentially expensive operations
+* Basic filter selectivity analysis (#3868)
+
+
+In the coming few months, we plan work on:
+* Improved grouping performance (TODO link)
+* bloom filtering
+* investigate RLE (Run End Encoding support) (todo Arrow link)
+* Enable predicate pushdown by default for all cases
+* OTHERS?
+
+## Runtime Resource Limits
+
+Initially, DataFusion could potentially use unbounded amounts of memory for 
certain queries that included Sorts, Grouping or Joins.
+
+In version 16.0.0, it is possible to limit DataFusion's memory usage for 
Sorting and Grouping. We are looking for help adding similar limiting for Joins 
as well as expanding our algorithms to spill to secondary storage, if 
available. See #3941 fore more detail.
+
+
+## SQL Window Function
+[SQL window functions](https://en.wikipedia.org/wiki/Window_function_(SQL))  
are useful for a variety of analysis and DataFusion's implementation is close 
to complete now.
+
+- Custom window frames such as `... OVER (ORDER BY ... RANGE BETWEEN 0.2 
PRECEDING AND 0.2

[GitHub] [arrow-site] ianmcook commented on a diff in pull request #296: [Website] Add ADBC release post

2023-01-10 Thread GitBox



ianmcook commented on code in PR #296:
URL: https://github.com/apache/arrow-site/pull/296#discussion_r1066309697


##
_posts/2023-01-13-adbc-0.1.0-release.md:
##
@@ -0,0 +1,79 @@
+---
+layout: post
+title: "Apache Arrow ADBC 0.1.0 (Libraries) Release"
+date: "2023-01-13 00:00:00"
+author: pmc
+categories: [release]
+---
+
+
+The Apache Arrow team is pleased to announce the 0.1.0 release of the
+Apache Arrow ADBC libraries. This covers includes [**63 resolved
+issues**][1] from [**8 distinct contributors**][2].
+
+This is a release of the **libraries**, which are at version 0.1.0.
+The **API specification** is versioned separately and is at version
+1.0.0.

Review Comment:
   This might be a good place to add a link to the Introducing ADBC blog post 
for readers interested in learning more about the specification



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] ianmcook commented on a diff in pull request #296: [Website] Add ADBC release post

2023-01-10 Thread GitBox



ianmcook commented on code in PR #296:
URL: https://github.com/apache/arrow-site/pull/296#discussion_r1066308344


##
_posts/2023-01-13-adbc-0.1.0-release.md:
##
@@ -0,0 +1,79 @@
+---
+layout: post
+title: "Apache Arrow ADBC 0.1.0 (Libraries) Release"
+date: "2023-01-13 00:00:00"
+author: pmc
+categories: [release]
+---
+
+
+The Apache Arrow team is pleased to announce the 0.1.0 release of the
+Apache Arrow ADBC libraries. This covers includes [**63 resolved
+issues**][1] from [**8 distinct contributors**][2].
+
+This is a release of the **libraries**, which are at version 0.1.0.
+The **API specification** is versioned separately and is at version
+1.0.0.
+
+The release notes below are not exhaustive and only expose selected highlights
+of the release. Many other bugfixes and improvements have been made: we refer
+you to the [complete changelog][3].
+
+## Release Highlights
+
+This initial release includes the following:
+
+- Driver manager libraries for C/C++, Go, Java, Python, and Ruby.
+- ADBC drivers for SQLite and PostgreSQL, available in C/C++, Go, Python, and 
Ruby.
+- ADBC drivers for Arrow FLight SQL and JDBC, available in Java.
+
+## Contributors
+
+```
+$ git shortlog -sn apache-arrow-adbc-0.1.0

Review Comment:
   Maybe use one of [these 
tricks](https://stackoverflow.com/questions/6889830/equivalence-of-git-log-exclude-author)
 to exclude dependabot from the output



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] github-actions[bot] commented on pull request #296: [Website] Add ADBC release post

2023-01-10 Thread GitBox



github-actions[bot] commented on PR #296:
URL: https://github.com/apache/arrow-site/pull/296#issuecomment-1377705589

   
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
   ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
 * [Other pull requests](https://github.com/apache/arrow-site/pulls/)
 * [Contribution Guidelines - How to contribute 
patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] andygrove commented on pull request #294: [WEBSITE] DataFusion 16.0.0 blog post (WIP)

2023-01-10 Thread GitBox



andygrove commented on PR #294:
URL: https://github.com/apache/arrow-site/pull/294#issuecomment-1377545928

   I will start contributing to this tomorrow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] alamb commented on a diff in pull request #294: [WEBSITE] DataFusion 16.0.0 blog post (WIP)

2023-01-10 Thread GitBox



alamb commented on code in PR #294:
URL: https://github.com/apache/arrow-site/pull/294#discussion_r1065816100


##
_posts/2023-01-07-datafusion-16.0.0.md:
##
@@ -0,0 +1,289 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 16.0.0 Project Update"
+date: "2023-01-07 00:00:00"
+author: pmc
+categories: [release]
+---
+
+
+# Introduction
+
+[DataFusion](https://arrow.apache.org/datafusion/) is an extensible
+query execution framework, written in [Rust](https://www.rust-lang.org/),
+that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format. It is targeted primarily at developers creating data
+intensive analytics, and offers mature
+[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html),
+a DataFrame API, and many extension points.
+
+DataFusion based systems perform very well on performance
+benchmarks, especially considering they operate on data in parquet
+files directly rather than first loading into a specialized format.
+Some recent highlights include [clickbench](https://benchmark.clickhouse.com/)
+and the [Cloudfuse.io standalone query 
engines](https://www.cloudfuse.io/dashboards/standalone-engines) page.
+
+DataFusion is part of a longer term trend, articulated clearly by [Andy 
Pavlo](http://www.cs.cmu.edu/~pavlo/) in his
+[2022 Databases 
Retrospective](https://ottertune.com/blog/2022-databases-retrospective/).
+Database frameworks are proliferating and it is likely that all OLAP DBMSs and 
other many data heavy applications such as machine learning, will require a 
vectorized, highly performant query
+engine in the next 5 years to remain relevant.
+The only practical way to make such technology so widely available
+without many millions of dollars of investment is
+though open source engine such as DataFusion or 
[Velox](https://github.com/facebookincubator/velox).
+
+The rest of this post describes the improvements made to DataFusion
+over the last three months and some hints of where we are heading.
+
+## Community Growth
+
+The three months since [our last 
update](https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/) again saw 
significant growth in the DataFusion.
+TODO quantify the growth -- e.g. XXX new contributors to the project and 
regularly merge YYY PRs a day.
+
+Growth of new systems based on as the engine in [many open source and 
commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and 
was one of the early open source projects to provide this capability.
+
+Several new databases built on datafusion (synnada.ai, greptimedb, probably 
others)

Review Comment:
   Thanks -- added in ffe2e0af210. Still needs polish



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] liukun4515 commented on a diff in pull request #294: [WEBSITE] DataFusion 16.0.0 blog post (WIP)

2023-01-10 Thread GitBox



liukun4515 commented on code in PR #294:
URL: https://github.com/apache/arrow-site/pull/294#discussion_r1065740157


##
_posts/2023-01-07-datafusion-16.0.0.md:
##
@@ -0,0 +1,289 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 16.0.0 Project Update"
+date: "2023-01-07 00:00:00"
+author: pmc
+categories: [release]
+---
+
+
+# Introduction
+
+[DataFusion](https://arrow.apache.org/datafusion/) is an extensible
+query execution framework, written in [Rust](https://www.rust-lang.org/),
+that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format. It is targeted primarily at developers creating data
+intensive analytics, and offers mature
+[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html),
+a DataFrame API, and many extension points.
+
+DataFusion based systems perform very well on performance
+benchmarks, especially considering they operate on data in parquet
+files directly rather than first loading into a specialized format.
+Some recent highlights include [clickbench](https://benchmark.clickhouse.com/)
+and the [Cloudfuse.io standalone query 
engines](https://www.cloudfuse.io/dashboards/standalone-engines) page.
+
+DataFusion is part of a longer term trend, articulated clearly by [Andy 
Pavlo](http://www.cs.cmu.edu/~pavlo/) in his
+[2022 Databases 
Retrospective](https://ottertune.com/blog/2022-databases-retrospective/).
+Database frameworks are proliferating and it is likely that all OLAP DBMSs and 
other many data heavy applications such as machine learning, will require a 
vectorized, highly performant query
+engine in the next 5 years to remain relevant.
+The only practical way to make such technology so widely available
+without many millions of dollars of investment is
+though open source engine such as DataFusion or 
[Velox](https://github.com/facebookincubator/velox).
+
+The rest of this post describes the improvements made to DataFusion
+over the last three months and some hints if where we are heading.
+
+## Community Growth
+
+The three months since [our last 
update](https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/) again saw 
significant growth in the DataFusion.
+TODO quantify the growth -- e.g. XXX new contributors to the project and 
regularly merge YYY PRs a day.
+
+Growth of new systems based on as the engine in [many open source and 
commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and 
was one of the early open source projects to provide this capability.
+
+Several new databases built on datafusion (synnada.ai, greptimedb, probably 
others)
+GA of InfluxDB IOx
+
+
+The DataFusion 16.0.0 release consists of 520 PRs from 70 distinct 
contributors. This does not count all the work that goes into our dependencies 
such as [arrow](https://crates.io/crates/arrow),  
[parquet](https://crates.io/crates/parquet), and 
[object_store](https://crates.io/crates/object_store), that much of the same 
community helps nurture.
+
+
+
+## Performance 
+
+Performance and efficiency are core value propositions for
+DataFusion. While there is still a performance gap between DataFusion best of
+breed tightly, integrated systems such as [DuckDB](https://duckdb.org)
+and [Polars](https://www.pola.rs/)https://www.pola.rs/), DataFusion is
+closing the gap quickly. Performance highlights from the last three
+months:
+
+* XX% Faster Sorting and Merging using the new [Row 
Format](https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/)
+* [Advanced predicate 
pushdown](https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/),
 directly on parquet, optionally directly from object storage, enabling sub 
millisecond filtering, directly from object storage 
+* Improved `IN` expressions significantly faster  Simplify InListExpr ~20-70% 
Faster ([#4057])
+* Sort and partition aware optimizations such as  #3969 and  #4691,  skipping 
potentially expensive operations
+* Basic filter selectivity analysis (#3868)
+
+
+In the coming few months, we plan work on:
+* Improved grouping performance (TODO link)
+* bloom filtering
+* investigate RLE (Run End Encoding support) (todo Arrow link)
+* Enable predicate pushdown by default for all cases
+* OTHERS?
+
+## Runtime Resource Limits
+
+Initially, DataFusion could potentially use unbounded amounts of memory for 
certain queries that included Sorts, Grouping or Joins.
+
+In version 16.0.0, it is possible to limit DataFusion's memory usage for 
Sorting and Grouping. We are looking for help adding similar limiting for Joins 
as well as expanding our algorithms to spill to secondary storage, if 
available. See #3941 fore more detail.
+
+
+## SQL Window Function
+[SQL window functions](https://en.wikipedia.org/wiki/Window_function_(SQL))  
are useful for a variety of analysis and DataFusion's implementation is close 
to complete now.
+
+- Custom window frames such as `... OVER (ORDER BY ... RANGE BETWEEN 0.2 
PRECEDING AND 0.2

[GitHub] [arrow-site] kou commented on a diff in pull request #295: [Website] Add links to UKV

2023-01-09 Thread GitBox



kou commented on code in PR #295:
URL: https://github.com/apache/arrow-site/pull/295#discussion_r1064587271


##
powered_by.md:
##
@@ -184,6 +184,14 @@ short description of your use case.
   Database Connectivity (ODBC) interface. It provides the ability to return
   Arrow Tables and RecordBatches in addition to the Python Database API
   Specification 2.0.
+* **[UKV][45]:** Open NoSQL binary database interface, with support for
+  LevelDB, RocksDB, UDisk, and in-memory Key-Value Stores. It extends
+  their functionality to support Document Collections, Graphs, and Vector
+  Search, similar to RedisJSON, RedisGraph, and RediSearch, and brings
+  familiar structured bindings on top, mimicking tools like Pandas and 
NetworkX.

Review Comment:
   ```suggestion
 familiar structured bindings on top, mimicking tools like pandas and 
NetworkX.
   ```



##
use_cases.md:
##
@@ -64,7 +64,9 @@ The Arrow format also defines a [C data interface]({% 
post_url 2020-05-04-introd
 which allows zero-copy data sharing inside a single process without any
 build-time or link-time dependency requirements. This allows, for example,
 [R users to access `pyarrow`-based projects]({{ site.baseurl 
}}/docs/r/articles/python.html)
-using the `reticulate` package.
+using the `reticulate` package. Similarly, it empowers 
[UKV](https://unum.cloud/ukv)
+to forward persisted data from RocksDB, LevelDB, and UDisk, into Python
+runtime and `pyarrow` without copies.

Review Comment:
   Could you revert this? It seems that we use use cases only in Apache Arrow 
project.



##
use_cases.md:
##
@@ -81,7 +83,8 @@ and [others]({{ site.baseurl }}/powered_by/) also use Arrow 
similarly.
 
 The Arrow project also defines [Flight]({% post_url 
2019-09-30-introducing-arrow-flight %}),
 a client-server RPC framework to build rich services exchanging data according
-to application-defined semantics.
+to application-defined semantics. Flight RPC is used by 
[UKV](https://unum.cloud/ukv)
+to exchange tables, documents, and graphs, between server application and 
client SDKs.

Review Comment:
   Could you revert this? We refer the `powered_by/` page in the above 
paragraph. UKV is introduced in the page.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] github-actions[bot] commented on pull request #295: [Website] Add links to UKV

2023-01-09 Thread GitBox



github-actions[bot] commented on PR #295:
URL: https://github.com/apache/arrow-site/pull/295#issuecomment-1375332242

   
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
   ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
 * [Other pull requests](https://github.com/apache/arrow-site/pulls/)
 * [Contribution Guidelines - How to contribute 
patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] ashvardanian opened a new pull request, #295: [Website] Add links to UKV

2023-01-09 Thread GitBox



ashvardanian opened a new pull request, #295:
URL: https://github.com/apache/arrow-site/pull/295

   We have been integrating Apache Arrow across all of our projects during 2022 
and hoping to share them with the broader community.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] ozankabak commented on a diff in pull request #294: [WEBSITE] DataFusion 16.0.0 blog post (WIP)

2023-01-08 Thread GitBox



ozankabak commented on code in PR #294:
URL: https://github.com/apache/arrow-site/pull/294#discussion_r1064195213


##
_posts/2023-01-07-datafusion-16.0.0.md:
##
@@ -0,0 +1,289 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 16.0.0 Project Update"
+date: "2023-01-07 00:00:00"
+author: pmc
+categories: [release]
+---
+
+
+# Introduction
+
+[DataFusion](https://arrow.apache.org/datafusion/) is an extensible
+query execution framework, written in [Rust](https://www.rust-lang.org/),
+that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format. It is targeted primarily at developers creating data
+intensive analytics, and offers mature
+[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html),
+a DataFrame API, and many extension points.
+
+DataFusion based systems perform very well on performance
+benchmarks, especially considering they operate on data in parquet
+files directly rather than first loading into a specialized format.
+Some recent highlights include [clickbench](https://benchmark.clickhouse.com/)
+and the [Cloudfuse.io standalone query 
engines](https://www.cloudfuse.io/dashboards/standalone-engines) page.
+
+DataFusion is part of a longer term trend, articulated clearly by [Andy 
Pavlo](http://www.cs.cmu.edu/~pavlo/) in his
+[2022 Databases 
Retrospective](https://ottertune.com/blog/2022-databases-retrospective/).
+Database frameworks are proliferating and it is likely that all OLAP DBMSs and 
other many data heavy applications such as machine learning, will require a 
vectorized, highly performant query
+engine in the next 5 years to remain relevant.
+The only practical way to make such technology so widely available
+without many millions of dollars of investment is
+though open source engine such as DataFusion or 
[Velox](https://github.com/facebookincubator/velox).
+
+The rest of this post describes the improvements made to DataFusion
+over the last three months and some hints of where we are heading.
+
+## Community Growth
+
+The three months since [our last 
update](https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/) again saw 
significant growth in the DataFusion.
+TODO quantify the growth -- e.g. XXX new contributors to the project and 
regularly merge YYY PRs a day.
+
+Growth of new systems based on as the engine in [many open source and 
commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and 
was one of the early open source projects to provide this capability.
+
+Several new databases built on datafusion (synnada.ai, greptimedb, probably 
others)

Review Comment:
   Here is what I am aware of:
   
   Databases: greptimedb (new), IOx (GA)
   Data platform: Synnada (new)
   Use case: Backend for PRQL (relatively new?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] ozankabak commented on a diff in pull request #294: [WEBSITE] DataFusion 16.0.0 blog post (WIP)

2023-01-08 Thread GitBox



ozankabak commented on code in PR #294:
URL: https://github.com/apache/arrow-site/pull/294#discussion_r1064194546


##
_posts/2023-01-07-datafusion-16.0.0.md:
##
@@ -0,0 +1,289 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 16.0.0 Project Update"
+date: "2023-01-07 00:00:00"
+author: pmc
+categories: [release]
+---
+
+
+# Introduction
+
+[DataFusion](https://arrow.apache.org/datafusion/) is an extensible
+query execution framework, written in [Rust](https://www.rust-lang.org/),
+that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format. It is targeted primarily at developers creating data
+intensive analytics, and offers mature
+[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html),
+a DataFrame API, and many extension points.
+
+DataFusion based systems perform very well on performance
+benchmarks, especially considering they operate on data in parquet
+files directly rather than first loading into a specialized format.
+Some recent highlights include [clickbench](https://benchmark.clickhouse.com/)
+and the [Cloudfuse.io standalone query 
engines](https://www.cloudfuse.io/dashboards/standalone-engines) page.
+
+DataFusion is part of a longer term trend, articulated clearly by [Andy 
Pavlo](http://www.cs.cmu.edu/~pavlo/) in his
+[2022 Databases 
Retrospective](https://ottertune.com/blog/2022-databases-retrospective/).
+Database frameworks are proliferating and it is likely that all OLAP DBMSs and 
other many data heavy applications such as machine learning, will require a 
vectorized, highly performant query
+engine in the next 5 years to remain relevant.
+The only practical way to make such technology so widely available
+without many millions of dollars of investment is
+though open source engine such as DataFusion or 
[Velox](https://github.com/facebookincubator/velox).
+
+The rest of this post describes the improvements made to DataFusion
+over the last three months and some hints if where we are heading.
+
+## Community Growth
+
+The three months since [our last 
update](https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/) again saw 
significant growth in the DataFusion.
+TODO quantify the growth -- e.g. XXX new contributors to the project and 
regularly merge YYY PRs a day.
+
+Growth of new systems based on as the engine in [many open source and 
commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and 
was one of the early open source projects to provide this capability.
+
+Several new databases built on datafusion (synnada.ai, greptimedb, probably 
others)
+GA of InfluxDB IOx
+
+
+The DataFusion 16.0.0 release consists of 520 PRs from 70 distinct 
contributors. This does not count all the work that goes into our dependencies 
such as [arrow](https://crates.io/crates/arrow),  
[parquet](https://crates.io/crates/parquet), and 
[object_store](https://crates.io/crates/object_store), that much of the same 
community helps nurture.
+
+
+
+## Performance 
+
+Performance and efficiency are core value propositions for
+DataFusion. While there is still a performance gap between DataFusion best of
+breed tightly, integrated systems such as [DuckDB](https://duckdb.org)
+and [Polars](https://www.pola.rs/)https://www.pola.rs/), DataFusion is
+closing the gap quickly. Performance highlights from the last three
+months:
+
+* XX% Faster Sorting and Merging using the new [Row 
Format](https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/)
+* [Advanced predicate 
pushdown](https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/),
 directly on parquet, optionally directly from object storage, enabling sub 
millisecond filtering, directly from object storage 
+* Improved `IN` expressions significantly faster  Simplify InListExpr ~20-70% 
Faster ([#4057])
+* Sort and partition aware optimizations such as  #3969 and  #4691,  skipping 
potentially expensive operations
+* Basic filter selectivity analysis (#3868)
+
+
+In the coming few months, we plan work on:
+* Improved grouping performance (TODO link)
+* bloom filtering
+* investigate RLE (Run End Encoding support) (todo Arrow link)
+* Enable predicate pushdown by default for all cases
+* OTHERS?
+
+## Runtime Resource Limits
+
+Initially, DataFusion could potentially use unbounded amounts of memory for 
certain queries that included Sorts, Grouping or Joins.
+
+In version 16.0.0, it is possible to limit DataFusion's memory usage for 
Sorting and Grouping. We are looking for help adding similar limiting for Joins 
as well as expanding our algorithms to spill to secondary storage, if 
available. See #3941 fore more detail.
+
+
+## SQL Window Function
+[SQL window functions](https://en.wikipedia.org/wiki/Window_function_(SQL))  
are useful for a variety of analysis and DataFusion's implementation is close 
to complete now.
+
+- Custom window frames such as `... OVER (ORDER BY ... RANGE BETWEEN 0.2 
PRECEDING AND 0.2

[GitHub] [arrow-site] andygrove commented on a diff in pull request #294: [WEBSITE] DataFusion 16.0.0 blog post (WIP)

2023-01-07 Thread GitBox



andygrove commented on code in PR #294:
URL: https://github.com/apache/arrow-site/pull/294#discussion_r1064050185


##
_posts/2023-01-07-datafusion-16.0.0.md:
##
@@ -0,0 +1,289 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 16.0.0 Project Update"
+date: "2023-01-07 00:00:00"
+author: pmc
+categories: [release]
+---
+
+
+# Introduction
+
+[DataFusion](https://arrow.apache.org/datafusion/) is an extensible
+query execution framework, written in [Rust](https://www.rust-lang.org/),
+that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format. It is targeted primarily at developers creating data
+intensive analytics, and offers mature
+[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html),
+a DataFrame API, and many extension points.
+
+DataFusion based systems perform very well on performance
+benchmarks, especially considering they operate on data in parquet
+files directly rather than first loading into a specialized format.
+Some recent highlights include [clickbench](https://benchmark.clickhouse.com/)
+and the [Cloudfuse.io standalone query 
engines](https://www.cloudfuse.io/dashboards/standalone-engines) page.
+
+DataFusion is part of a longer term trend, articulated clearly by [Andy 
Pavlo](http://www.cs.cmu.edu/~pavlo/) in his
+[2022 Databases 
Retrospective](https://ottertune.com/blog/2022-databases-retrospective/).
+Database frameworks are proliferating and it is likely that all OLAP DBMSs and 
other many data heavy applications such as machine learning, will require a 
vectorized, highly performant query
+engine in the next 5 years to remain relevant.
+The only practical way to make such technology so widely available
+without many millions of dollars of investment is
+though open source engine such as DataFusion or 
[Velox](https://github.com/facebookincubator/velox).
+
+The rest of this post describes the improvements made to DataFusion
+over the last three months and some hints if where we are heading.

Review Comment:
   ```suggestion
   over the last three months and some hints of where we are heading.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] ozankabak commented on a diff in pull request #294: [WEBSITE] DataFusion 16.0.0 blog post (WIP)

2023-01-07 Thread GitBox



ozankabak commented on code in PR #294:
URL: https://github.com/apache/arrow-site/pull/294#discussion_r1064040183


##
_posts/2023-01-07-datafusion-16.0.0.md:
##
@@ -0,0 +1,289 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 16.0.0 Project Update"
+date: "2023-01-07 00:00:00"
+author: pmc
+categories: [release]
+---
+
+
+# Introduction
+
+[DataFusion](https://arrow.apache.org/datafusion/) is an extensible
+query execution framework, written in [Rust](https://www.rust-lang.org/),
+that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format. It is targeted primarily at developers creating data
+intensive analytics, and offers mature
+[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html),
+a DataFrame API, and many extension points.
+
+DataFusion based systems perform very well on performance
+benchmarks, especially considering they operate on data in parquet
+files directly rather than first loading into a specialized format.
+Some recent highlights include [clickbench](https://benchmark.clickhouse.com/)
+and the [Cloudfuse.io standalone query 
engines](https://www.cloudfuse.io/dashboards/standalone-engines) page.
+
+DataFusion is part of a longer term trend, articulated clearly by [Andy 
Pavlo](http://www.cs.cmu.edu/~pavlo/) in his
+[2022 Databases 
Retrospective](https://ottertune.com/blog/2022-databases-retrospective/).
+Database frameworks are proliferating and it is likely that all OLAP DBMSs and 
other many data heavy applications such as machine learning, will require a 
vectorized, highly performant query
+engine in the next 5 years to remain relevant.
+The only practical way to make such technology so widely available
+without many millions of dollars of investment is
+though open source engine such as DataFusion or 
[Velox](https://github.com/facebookincubator/velox).
+
+The rest of this post describes the improvements made to DataFusion
+over the last three months and some hints if where we are heading.
+
+## Community Growth
+
+The three months since [our last 
update](https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/) again saw 
significant growth in the DataFusion.
+TODO quantify the growth -- e.g. XXX new contributors to the project and 
regularly merge YYY PRs a day.
+
+Growth of new systems based on as the engine in [many open source and 
commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and 
was one of the early open source projects to provide this capability.
+
+Several new databases built on datafusion (synnada.ai, greptimedb, probably 
others)
+GA of InfluxDB IOx
+
+
+The DataFusion 16.0.0 release consists of 520 PRs from 70 distinct 
contributors. This does not count all the work that goes into our dependencies 
such as [arrow](https://crates.io/crates/arrow),  
[parquet](https://crates.io/crates/parquet), and 
[object_store](https://crates.io/crates/object_store), that much of the same 
community helps nurture.
+
+
+
+## Performance 
+
+Performance and efficiency are core value propositions for
+DataFusion. While there is still a performance gap between DataFusion best of
+breed tightly, integrated systems such as [DuckDB](https://duckdb.org)
+and [Polars](https://www.pola.rs/)https://www.pola.rs/), DataFusion is
+closing the gap quickly. Performance highlights from the last three
+months:
+
+* XX% Faster Sorting and Merging using the new [Row 
Format](https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/)
+* [Advanced predicate 
pushdown](https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/),
 directly on parquet, optionally directly from object storage, enabling sub 
millisecond filtering, directly from object storage 
+* Improved `IN` expressions significantly faster  Simplify InListExpr ~20-70% 
Faster ([#4057])
+* Sort and partition aware optimizations such as  #3969 and  #4691,  skipping 
potentially expensive operations
+* Basic filter selectivity analysis (#3868)
+
+
+In the coming few months, we plan work on:
+* Improved grouping performance (TODO link)
+* bloom filtering
+* investigate RLE (Run End Encoding support) (todo Arrow link)
+* Enable predicate pushdown by default for all cases
+* OTHERS?
+
+## Runtime Resource Limits
+
+Initially, DataFusion could potentially use unbounded amounts of memory for 
certain queries that included Sorts, Grouping or Joins.
+
+In version 16.0.0, it is possible to limit DataFusion's memory usage for 
Sorting and Grouping. We are looking for help adding similar limiting for Joins 
as well as expanding our algorithms to spill to secondary storage, if 
available. See #3941 fore more detail.
+
+
+## SQL Window Function
+[SQL window functions](https://en.wikipedia.org/wiki/Window_function_(SQL))  
are useful for a variety of analysis and DataFusion's implementation is close 
to complete now.
+
+- Custom window frames such as `... OVER (ORDER BY ... RANGE BETWEEN 0.2 
PRECEDING AND 0.2

[GitHub] [arrow-site] alamb commented on a diff in pull request #294: [WEBSITE] DataFusion 16.0.0 blog post (WIP)

2023-01-07 Thread GitBox



alamb commented on code in PR #294:
URL: https://github.com/apache/arrow-site/pull/294#discussion_r1064019922


##
_posts/2023-01-07-datafusion-16.0.0.md:
##
@@ -0,0 +1,289 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 16.0.0 Project Update"
+date: "2023-01-07 00:00:00"
+author: pmc
+categories: [release]
+---
+
+
+# Introduction
+
+[DataFusion](https://arrow.apache.org/datafusion/) is an extensible
+query execution framework, written in [Rust](https://www.rust-lang.org/),
+that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format. It is targeted primarily at developers creating data
+intensive analytics, and offers mature
+[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html),
+a DataFrame API, and many extension points.
+
+DataFusion based systems perform very well on performance
+benchmarks, especially considering they operate on data in parquet
+files directly rather than first loading into a specialized format.
+Some recent highlights include [clickbench](https://benchmark.clickhouse.com/)
+and the [Cloudfuse.io standalone query 
engines](https://www.cloudfuse.io/dashboards/standalone-engines) page.
+
+DataFusion is part of a longer term trend, articulated clearly by [Andy 
Pavlo](http://www.cs.cmu.edu/~pavlo/) in his
+[2022 Databases 
Retrospective](https://ottertune.com/blog/2022-databases-retrospective/).
+Database frameworks are proliferating and it is likely that all OLAP DBMSs and 
other many data heavy applications such as machine learning, will require a 
vectorized, highly performant query
+engine in the next 5 years to remain relevant.
+The only practical way to make such technology so widely available
+without many millions of dollars of investment is
+though open source engine such as DataFusion or 
[Velox](https://github.com/facebookincubator/velox).
+
+The rest of this post describes the improvements made to DataFusion
+over the last three months and some hints if where we are heading.
+
+## Community Growth
+
+The three months since [our last 
update](https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/) again saw 
significant growth in the DataFusion.
+TODO quantify the growth -- e.g. XXX new contributors to the project and 
regularly merge YYY PRs a day.
+
+Growth of new systems based on as the engine in [many open source and 
commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and 
was one of the early open source projects to provide this capability.
+
+Several new databases built on datafusion (synnada.ai, greptimedb, probably 
others)
+GA of InfluxDB IOx
+
+
+The DataFusion 16.0.0 release consists of 520 PRs from 70 distinct 
contributors. This does not count all the work that goes into our dependencies 
such as [arrow](https://crates.io/crates/arrow),  
[parquet](https://crates.io/crates/parquet), and 
[object_store](https://crates.io/crates/object_store), that much of the same 
community helps nurture.
+
+
+
+## Performance 
+
+Performance and efficiency are core value propositions for
+DataFusion. While there is still a performance gap between DataFusion best of
+breed tightly, integrated systems such as [DuckDB](https://duckdb.org)
+and [Polars](https://www.pola.rs/)https://www.pola.rs/), DataFusion is
+closing the gap quickly. Performance highlights from the last three
+months:
+
+* XX% Faster Sorting and Merging using the new [Row 
Format](https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/)

Review Comment:
   @tustvold  do you have any suggstions about what numbers to use here?



##
_posts/2023-01-07-datafusion-16.0.0.md:
##
@@ -0,0 +1,289 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 16.0.0 Project Update"
+date: "2023-01-07 00:00:00"
+author: pmc
+categories: [release]
+---
+
+
+# Introduction
+
+[DataFusion](https://arrow.apache.org/datafusion/) is an extensible
+query execution framework, written in [Rust](https://www.rust-lang.org/),
+that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format. It is targeted primarily at developers creating data
+intensive analytics, and offers mature
+[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html),
+a DataFrame API, and many extension points.
+
+DataFusion based systems perform very well on performance
+benchmarks, especially considering they operate on data in parquet
+files directly rather than first loading into a specialized format.
+Some recent highlights include [clickbench](https://benchmark.clickhouse.com/)
+and the [Cloudfuse.io standalone query 
engines](https://www.cloudfuse.io/dashboards/standalone-engines) page.
+
+DataFusion is part of a longer term trend, articulated clearly by [Andy 
Pavlo](http://www.cs.cmu.edu/~pavlo/) in his
+[2022 Databases 
Retrospective](https://ottertune.com/blog/2022-databases-retrospective/).
+Database frameworks are proliferating and it is likely that all OLAP DBMSs and 
other many data heavy

[GitHub] [arrow-site] alamb commented on pull request #294: [WEBSITE] DataFusion 16.0.0 blog post (WIP)

2023-01-07 Thread GitBox



alamb commented on PR #294:
URL: https://github.com/apache/arrow-site/pull/294#issuecomment-1374526999

   It is a work in progress, but I think it is no coherent enough to gather 
some more inout


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] github-actions[bot] commented on pull request #294: [WEBSITE] DataFusion 16.0.0 blog post (WIP)

2023-01-07 Thread GitBox



github-actions[bot] commented on PR #294:
URL: https://github.com/apache/arrow-site/pull/294#issuecomment-1374526901

   
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
   ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
 * [Other pull requests](https://github.com/apache/arrow-site/pulls/)
 * [Contribution Guidelines - How to contribute 
patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] alamb opened a new pull request, #294: [WEBSITE] DataFusion 16.0.0 blog post

2023-01-07 Thread GitBox



alamb opened a new pull request, #294:
URL: https://github.com/apache/arrow-site/pull/294

   Closes https://github.com/apache/arrow-datafusion/issues/4804
   
   This blog post highlights some improvements and features in DataFusion the 
last 3 releases  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] lidavidm merged pull request #293: MINOR: Fix typo in ADBC post

2023-01-05 Thread GitBox



lidavidm merged PR #293:
URL: https://github.com/apache/arrow-site/pull/293


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] github-actions[bot] commented on pull request #293: MINOR: Fix typo in ADBC post

2023-01-05 Thread GitBox



github-actions[bot] commented on PR #293:
URL: https://github.com/apache/arrow-site/pull/293#issuecomment-1372769751

   
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
   ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
 * [Other pull requests](https://github.com/apache/arrow-site/pulls/)
 * [Contribution Guidelines - How to contribute 
patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] lidavidm merged pull request #248: [Website] Add ADBC blog post

2023-01-05 Thread GitBox



lidavidm merged PR #248:
URL: https://github.com/apache/arrow-site/pull/248


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] lidavidm commented on pull request #248: [Website] Add ADBC blog post

2023-01-05 Thread GitBox



lidavidm commented on PR #248:
URL: https://github.com/apache/arrow-site/pull/248#issuecomment-1372723648

   I'll be publishing this in a bit. Thanks to all who reviewed!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] lidavidm commented on pull request #248: [Website] Add ADBC blog post

2023-01-02 Thread GitBox



lidavidm commented on PR #248:
URL: https://github.com/apache/arrow-site/pull/248#issuecomment-1369084955

   Updated, thanks Ian! I tweaked the diagram too.
   
   Updated preview: 
https://dynamic-jalebi-94dd65.netlify.app/blog/2023/01/04/introducing-arrow-adbc/
 (since I had to bump the date forward)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] lidavidm commented on a diff in pull request #248: [Website] Add ADBC blog post

2023-01-02 Thread GitBox



lidavidm commented on code in PR #248:
URL: https://github.com/apache/arrow-site/pull/248#discussion_r1060123304


##
img/ADBC.svg:
##
@@ -0,0 +1 @@
+http://www.w3.org/2000/svg; 
xmlns:xlink="http://www.w3.org/1999/xlink; xmlns:lucid="lucid" width="800" 
height="600">

Review Comment:
   I meant something like that, yeah. The database only has to implement one 
(columnar, Arrow-based) endpoint/protocol but can support Arrow-native and 
'traditional' clients.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] ianmcook commented on a diff in pull request #248: [Website] Add ADBC blog post

2023-01-02 Thread GitBox



ianmcook commented on code in PR #248:
URL: https://github.com/apache/arrow-site/pull/248#discussion_r1060116272


##
_posts/2022-12-31-arrow-adbc.md:
##
@@ -0,0 +1,217 @@
+---
+layout: post
+title: "Introducing ADBC: Database Access for Apache Arrow"
+date: "2022-12-31 00:00:00"
+author: pmc
+categories: [application]
+---
+
+
+The Arrow community would like to introduce version 1.0.0 of the [Arrow 
Database Connectivity (ADBC)][adbc] specification.
+ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical 
applications.
+Or in other words: **ADBC is a single API for getting Arrow data in and out of 
different databases**.
+
+## Motivation
+
+Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to 
work with databases.
+That way, they can code to the same API regardless of the underlying database, 
saving on development time.
+Roughly speaking, when an application executes a query with these APIs:
+
+
+  
+  The query execution flow.
+
+
+1. The application submits a SQL query via the JDBC/ODBC API.
+2. The query is passed on to the driver.
+3. The driver translates the query to a database-specific protocol and sends 
it to the database.
+4. The database executes the query and returns the result set in a 
database-specific format.
+5. The driver translates the result format into the JDBC/ODBC API.
+6. The application iterates over the result rows using the JDBC/ODBC API.
+
+When columnar data comes into play, however, problems arise.
+JDBC is a row-oriented API, and while ODBC can support columnar data, the type 
system and data representation is not a perfect match with Arrow.
+So generally, columnar data must be converted to rows between steps 5 and 6, 
spending resources without performing "useful" work.
+
+This mismatch is problematic for columnar database systems, such as 
ClickHouse, Dremio, DuckDB, and Google BigQuery.
+On the client side, tools such as Apache Spark and pandas would be better off 
getting columnar data directly, skipping that conversion.
+Otherwise, they're leaving performance on the table.
+At the same time, that conversion isn't always avoidable.
+Row-oriented database systems like PostgreSQL aren't going away, and these 
clients will still want to consume data from them.
+
+Developers have a few options:
+
+- *Just use JDBC/ODBC*.
+  These standards are here to stay, and it makes sense for databases to 
support them for applications that want them.
+  But when both the database and the application are columnar, that means 
converting data into rows for JDBC/ODBC, only for the client to convert them 
right back into columns!
+  Performance suffers, and developers have to spend time implementing the 
conversions.
+- *Use JDBC/ODBC-to-Arrow conversion libraries*.
+  Libraries like [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc] handle row 
to columnar conversions for clients.
+  But this doesn't fundamentally solve the problem.
+  Unnecessary data conversions are still required.
+- *Use vendor-specific protocols*.
+  For some databases, applications can use a database-specific protocol or SDK 
to directly get Arrow data.
+  For example, applications could use Dremio via [Arrow Flight 
SQL][flight-sql].
+  But client applications that want to use multiple database vendors would 
need to integrate with each of them.
+  (Look at all the [connectors](https://trino.io/docs/current/connector.html) 
that Trino implements.)
+  And databases like PostgreSQL don't offer an option supporting Arrow in the 
first place.
+
+As is, clients must choose between either tedious integration work or leaving 
performance on the table. We can make this better.
+
+## Introducing ADBC
+
+ADBC is an Arrow-based, vendor-netural API for interacting with databases.
+Applications that use ADBC just get Arrow data.
+They don't have to do any conversions themselves, and they don't have to 
integrate each database's specific SDK.
+
+Just like JDBC/ODBC, underneath the ADBC API are drivers that translate the 
API for specific databases.
+
+* A driver for an Arrow-native database just passes Arrow data through without 
conversion.
+* A driver for a non-Arrow-native database must convert the data to Arrow.
+  This saves the application from doing that, and the driver can optimize the 
conversion for its database.
+
+
+  
+  The query execution flow with two different ADBC 
drivers.
+
+
+1. The application submits a SQL query via the ADBC API.
+2. The query is passed on to the ADBC driver.
+3. The driver translates the query to a database-specific protocol and sends 
the query to the database.
+4. The database executes the query and returns the result set in a 
database-specific format, which is ideally Arrow data.
+5. If needed: the driver translates the result into Arrow data.
+6. The application iterates over batches of Arrow data.
+
+The application only deals with one API, and only works with Arrow data.
+
+For example, in Python, the ADBC packages offer

[GitHub] [arrow-site] ianmcook commented on a diff in pull request #248: [Website] Add ADBC blog post

2023-01-02 Thread GitBox



ianmcook commented on code in PR #248:
URL: https://github.com/apache/arrow-site/pull/248#discussion_r1060114790


##
_posts/2022-12-31-arrow-adbc.md:
##
@@ -0,0 +1,217 @@
+---
+layout: post
+title: "Introducing ADBC: Database Access for Apache Arrow"
+date: "2022-12-31 00:00:00"
+author: pmc
+categories: [application]
+---
+
+
+The Arrow community would like to introduce version 1.0.0 of the [Arrow 
Database Connectivity (ADBC)][adbc] specification.
+ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical 
applications.
+Or in other words: **ADBC is a single API for getting Arrow data in and out of 
different databases**.
+
+## Motivation
+
+Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to 
work with databases.
+That way, they can code to the same API regardless of the underlying database, 
saving on development time.
+Roughly speaking, when an application executes a query with these APIs:
+
+
+  
+  The query execution flow.
+
+
+1. The application submits a SQL query via the JDBC/ODBC API.
+2. The query is passed on to the driver.
+3. The driver translates the query to a database-specific protocol and sends 
it to the database.
+4. The database executes the query and returns the result set in a 
database-specific format.
+5. The driver translates the result format into the JDBC/ODBC API.
+6. The application iterates over the result rows using the JDBC/ODBC API.
+
+When columnar data comes into play, however, problems arise.
+JDBC is a row-oriented API, and while ODBC can support columnar data, the type 
system and data representation is not a perfect match with Arrow.
+So generally, columnar data must be converted to rows between steps 5 and 6, 
spending resources without performing "useful" work.
+
+This mismatch is problematic for columnar database systems, such as 
ClickHouse, Dremio, DuckDB, and Google BigQuery.
+On the client side, tools such as Apache Spark and pandas would be better off 
getting columnar data directly, skipping that conversion.
+Otherwise, they're leaving performance on the table.
+At the same time, that conversion isn't always avoidable.
+Row-oriented database systems like PostgreSQL aren't going away, and these 
clients will still want to consume data from them.
+
+Developers have a few options:
+
+- *Just use JDBC/ODBC*.
+  These standards are here to stay, and it makes sense for databases to 
support them for applications that want them.
+  But when both the database and the application are columnar, that means 
converting data into rows for JDBC/ODBC, only for the client to convert them 
right back into columns!
+  Performance suffers, and developers have to spend time implementing the 
conversions.
+- *Use JDBC/ODBC-to-Arrow conversion libraries*.
+  Libraries like [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc] handle row 
to columnar conversions for clients.
+  But this doesn't fundamentally solve the problem.
+  Unnecessary data conversions are still required.
+- *Use vendor-specific protocols*.
+  For some databases, applications can use a database-specific protocol or SDK 
to directly get Arrow data.
+  For example, applications could use Dremio via [Arrow Flight 
SQL][flight-sql].
+  But client applications that want to use multiple database vendors would 
need to integrate with each of them.
+  (Look at all the [connectors](https://trino.io/docs/current/connector.html) 
that Trino implements.)
+  And databases like PostgreSQL don't offer an option supporting Arrow in the 
first place.
+
+As is, clients must choose between either tedious integration work or leaving 
performance on the table. We can make this better.
+
+## Introducing ADBC
+
+ADBC is an Arrow-based, vendor-netural API for interacting with databases.
+Applications that use ADBC just get Arrow data.
+They don't have to do any conversions themselves, and they don't have to 
integrate each database's specific SDK.
+
+Just like JDBC/ODBC, underneath the ADBC API are drivers that translate the 
API for specific databases.
+
+* A driver for an Arrow-native database just passes Arrow data through without 
conversion.
+* A driver for a non-Arrow-native database must convert the data to Arrow.
+  This saves the application from doing that, and the driver can optimize the 
conversion for its database.
+
+
+  
+  The query execution flow with two different ADBC 
drivers.
+
+
+1. The application submits a SQL query via the ADBC API.
+2. The query is passed on to the ADBC driver.
+3. The driver translates the query to a database-specific protocol and sends 
the query to the database.
+4. The database executes the query and returns the result set in a 
database-specific format, which is ideally Arrow data.
+5. If needed: the driver translates the result into Arrow data.
+6. The application iterates over batches of Arrow data.
+
+The application only deals with one API, and only works with Arrow data.
+
+For example, in Python, the ADBC packages offer

[GitHub] [arrow-site] ianmcook commented on a diff in pull request #248: [Website] Add ADBC blog post

2023-01-02 Thread GitBox



ianmcook commented on code in PR #248:
URL: https://github.com/apache/arrow-site/pull/248#discussion_r1060114466


##
_posts/2022-12-31-arrow-adbc.md:
##
@@ -0,0 +1,217 @@
+---
+layout: post
+title: "Introducing ADBC: Database Access for Apache Arrow"
+date: "2022-12-31 00:00:00"
+author: pmc
+categories: [application]
+---
+
+
+The Arrow community would like to introduce version 1.0.0 of the [Arrow 
Database Connectivity (ADBC)][adbc] specification.
+ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical 
applications.
+Or in other words: **ADBC is a single API for getting Arrow data in and out of 
different databases**.
+
+## Motivation
+
+Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to 
work with databases.
+That way, they can code to the same API regardless of the underlying database, 
saving on development time.
+Roughly speaking, when an application executes a query with these APIs:
+
+
+  
+  The query execution flow.
+
+
+1. The application submits a SQL query via the JDBC/ODBC API.
+2. The query is passed on to the driver.
+3. The driver translates the query to a database-specific protocol and sends 
it to the database.
+4. The database executes the query and returns the result set in a 
database-specific format.
+5. The driver translates the result format into the JDBC/ODBC API.
+6. The application iterates over the result rows using the JDBC/ODBC API.
+
+When columnar data comes into play, however, problems arise.
+JDBC is a row-oriented API, and while ODBC can support columnar data, the type 
system and data representation is not a perfect match with Arrow.
+So generally, columnar data must be converted to rows between steps 5 and 6, 
spending resources without performing "useful" work.
+
+This mismatch is problematic for columnar database systems, such as 
ClickHouse, Dremio, DuckDB, and Google BigQuery.
+On the client side, tools such as Apache Spark and pandas would be better off 
getting columnar data directly, skipping that conversion.
+Otherwise, they're leaving performance on the table.
+At the same time, that conversion isn't always avoidable.
+Row-oriented database systems like PostgreSQL aren't going away, and these 
clients will still want to consume data from them.
+
+Developers have a few options:
+
+- *Just use JDBC/ODBC*.
+  These standards are here to stay, and it makes sense for databases to 
support them for applications that want them.
+  But when both the database and the application are columnar, that means 
converting data into rows for JDBC/ODBC, only for the client to convert them 
right back into columns!
+  Performance suffers, and developers have to spend time implementing the 
conversions.
+- *Use JDBC/ODBC-to-Arrow conversion libraries*.
+  Libraries like [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc] handle row 
to columnar conversions for clients.
+  But this doesn't fundamentally solve the problem.
+  Unnecessary data conversions are still required.
+- *Use vendor-specific protocols*.
+  For some databases, applications can use a database-specific protocol or SDK 
to directly get Arrow data.
+  For example, applications could use Dremio via [Arrow Flight 
SQL][flight-sql].
+  But client applications that want to use multiple database vendors would 
need to integrate with each of them.
+  (Look at all the [connectors](https://trino.io/docs/current/connector.html) 
that Trino implements.)
+  And databases like PostgreSQL don't offer an option supporting Arrow in the 
first place.
+
+As is, clients must choose between either tedious integration work or leaving 
performance on the table. We can make this better.
+
+## Introducing ADBC
+
+ADBC is an Arrow-based, vendor-netural API for interacting with databases.
+Applications that use ADBC just get Arrow data.
+They don't have to do any conversions themselves, and they don't have to 
integrate each database's specific SDK.
+
+Just like JDBC/ODBC, underneath the ADBC API are drivers that translate the 
API for specific databases.
+
+* A driver for an Arrow-native database just passes Arrow data through without 
conversion.
+* A driver for a non-Arrow-native database must convert the data to Arrow.
+  This saves the application from doing that, and the driver can optimize the 
conversion for its database.
+
+
+  
+  The query execution flow with two different ADBC 
drivers.
+
+
+1. The application submits a SQL query via the ADBC API.
+2. The query is passed on to the ADBC driver.
+3. The driver translates the query to a database-specific protocol and sends 
the query to the database.
+4. The database executes the query and returns the result set in a 
database-specific format, which is ideally Arrow data.
+5. If needed: the driver translates the result into Arrow data.
+6. The application iterates over batches of Arrow data.
+
+The application only deals with one API, and only works with Arrow data.
+
+For example, in Python, the ADBC packages offer

[GitHub] [arrow-site] ianmcook commented on a diff in pull request #248: [Website] Add ADBC blog post

2023-01-02 Thread GitBox



ianmcook commented on code in PR #248:
URL: https://github.com/apache/arrow-site/pull/248#discussion_r1060113919


##
img/ADBC.svg:
##
@@ -0,0 +1 @@
+http://www.w3.org/2000/svg; 
xmlns:xlink="http://www.w3.org/1999/xlink; xmlns:lucid="lucid" width="800" 
height="600">

Review Comment:
   This text in the diagram seems confusing:
   >The database only works with Arrow data, regardless of the actual client.
   
   The database does not necessarily "work with" Arrow data. It (ultimately) 
emits Arrow data, but internally might be working with the data in some other 
database-specific format. Is there a better way to express what you mean by 
this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] ianmcook commented on a diff in pull request #248: [Website] Add ADBC blog post

2023-01-02 Thread GitBox



ianmcook commented on code in PR #248:
URL: https://github.com/apache/arrow-site/pull/248#discussion_r1060110516


##
_posts/2022-12-31-arrow-adbc.md:
##
@@ -0,0 +1,217 @@
+---
+layout: post
+title: "Introducing ADBC: Database Access for Apache Arrow"
+date: "2022-12-31 00:00:00"
+author: pmc
+categories: [application]
+---
+
+
+The Arrow community would like to introduce version 1.0.0 of the [Arrow 
Database Connectivity (ADBC)][adbc] specification.
+ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical 
applications.
+Or in other words: **ADBC is a single API for getting Arrow data in and out of 
different databases**.
+
+## Motivation
+
+Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to 
work with databases.
+That way, they can code to the same API regardless of the underlying database, 
saving on development time.
+Roughly speaking, when an application executes a query with these APIs:
+
+
+  
+  The query execution flow.
+
+
+1. The application submits a SQL query via the JDBC/ODBC API.
+2. The query is passed on to the driver.
+3. The driver translates the query to a database-specific protocol and sends 
it to the database.
+4. The database executes the query and returns the result set in a 
database-specific format.
+5. The driver translates the result format into the JDBC/ODBC API.
+6. The application iterates over the result rows using the JDBC/ODBC API.
+
+When columnar data comes into play, however, problems arise.
+JDBC is a row-oriented API, and while ODBC can support columnar data, the type 
system and data representation is not a perfect match with Arrow.
+So generally, columnar data must be converted to rows between steps 5 and 6, 
spending resources without performing "useful" work.
+
+This mismatch is problematic for columnar database systems, such as 
ClickHouse, Dremio, DuckDB, and Google BigQuery.
+On the client side, tools such as Apache Spark and pandas would be better off 
getting columnar data directly, skipping that conversion.
+Otherwise, they're leaving performance on the table.
+At the same time, that conversion isn't always avoidable.
+Row-oriented database systems like PostgreSQL aren't going away, and these 
clients will still want to consume data from them.
+
+Developers have a few options:
+
+- *Just use JDBC/ODBC*.
+  These standards are here to stay, and it makes sense for databases to 
support them for applications that want them.
+  But when both the database and the application are columnar, that means 
converting data into rows for JDBC/ODBC, only for the client to convert them 
right back into columns!
+  Performance suffers, and developers have to spend time implementing the 
conversions.
+- *Use JDBC/ODBC-to-Arrow conversion libraries*.
+  Libraries like [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc] handle row 
to columnar conversions for clients.
+  But this doesn't fundamentally solve the problem.
+  Unnecessary data conversions are still required.
+- *Use vendor-specific protocols*.
+  For some databases, applications can use a database-specific protocol or SDK 
to directly get Arrow data.
+  For example, applications could use Dremio via [Arrow Flight 
SQL][flight-sql].
+  But client applications that want to use multiple database vendors would 
need to integrate with each of them.
+  (Look at all the [connectors](https://trino.io/docs/current/connector.html) 
that Trino implements.)
+  And databases like PostgreSQL don't offer an option supporting Arrow in the 
first place.
+
+As is, clients must choose between either tedious integration work or leaving 
performance on the table. We can make this better.
+
+## Introducing ADBC
+
+ADBC is an Arrow-based, vendor-netural API for interacting with databases.
+Applications that use ADBC just get Arrow data.
+They don't have to do any conversions themselves, and they don't have to 
integrate each database's specific SDK.
+
+Just like JDBC/ODBC, underneath the ADBC API are drivers that translate the 
API for specific databases.
+
+* A driver for an Arrow-native database just passes Arrow data through without 
conversion.
+* A driver for a non-Arrow-native database must convert the data to Arrow.
+  This saves the application from doing that, and the driver can optimize the 
conversion for its database.
+
+
+  
+  The query execution flow with two different ADBC 
drivers.
+
+
+1. The application submits a SQL query via the ADBC API.
+2. The query is passed on to the ADBC driver.
+3. The driver translates the query to a database-specific protocol and sends 
the query to the database.
+4. The database executes the query and returns the result set in a 
database-specific format, which is ideally Arrow data.
+5. If needed: the driver translates the result into Arrow data.
+6. The application iterates over batches of Arrow data.
+
+The application only deals with one API, and only works with Arrow data.
+
+For example, in Python, the ADBC packages offer

[GitHub] [arrow-site] ianmcook commented on a diff in pull request #248: [Website] Add ADBC blog post

2023-01-02 Thread GitBox



ianmcook commented on code in PR #248:
URL: https://github.com/apache/arrow-site/pull/248#discussion_r1060105359


##
_posts/2022-12-31-arrow-adbc.md:
##
@@ -0,0 +1,217 @@
+---
+layout: post
+title: "Introducing ADBC: Database Access for Apache Arrow"
+date: "2022-12-31 00:00:00"
+author: pmc
+categories: [application]
+---
+
+
+The Arrow community would like to introduce version 1.0.0 of the [Arrow 
Database Connectivity (ADBC)][adbc] specification.
+ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical 
applications.
+Or in other words: **ADBC is a single API for getting Arrow data in and out of 
different databases**.
+
+## Motivation
+
+Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to 
work with databases.
+That way, they can code to the same API regardless of the underlying database, 
saving on development time.
+Roughly speaking, when an application executes a query with these APIs:
+
+
+  
+  The query execution flow.
+
+
+1. The application submits a SQL query via the JDBC/ODBC API.
+2. The query is passed on to the driver.
+3. The driver translates the query to a database-specific protocol and sends 
it to the database.
+4. The database executes the query and returns the result set in a 
database-specific format.
+5. The driver translates the result format into the JDBC/ODBC API.
+6. The application iterates over the result rows using the JDBC/ODBC API.
+
+When columnar data comes into play, however, problems arise.
+JDBC is a row-oriented API, and while ODBC can support columnar data, the type 
system and data representation is not a perfect match with Arrow.
+So generally, columnar data must be converted to rows between steps 5 and 6, 
spending resources without performing "useful" work.
+
+This mismatch is problematic for columnar database systems, such as 
ClickHouse, Dremio, DuckDB, and Google BigQuery.
+On the client side, tools such as Apache Spark and pandas would be better off 
getting columnar data directly, skipping that conversion.
+Otherwise, they're leaving performance on the table.
+At the same time, that conversion isn't always avoidable.
+Row-oriented database systems like PostgreSQL aren't going away, and these 
clients will still want to consume data from them.
+
+Developers have a few options:
+
+- *Just use JDBC/ODBC*.
+  These standards are here to stay, and it makes sense for databases to 
support them for applications that want them.
+  But when both the database and the application are columnar, that means 
converting data into rows for JDBC/ODBC, only for the client to convert them 
right back into columns!
+  Performance suffers, and developers have to spend time implementing the 
conversions.
+- *Use JDBC/ODBC-to-Arrow conversion libraries*.
+  Libraries like [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc] handle row 
to columnar conversions for clients.
+  But this doesn't fundamentally solve the problem.
+  Unnecessary data conversions are still required.
+- *Use vendor-specific protocols*.
+  For some databases, applications can use a database-specific protocol or SDK 
to directly get Arrow data.
+  For example, applications could use Dremio via [Arrow Flight 
SQL][flight-sql].
+  But client applications that want to use multiple database vendors would 
need to integrate with each of them.
+  (Look at all the [connectors](https://trino.io/docs/current/connector.html) 
that Trino implements.)
+  And databases like PostgreSQL don't offer an option supporting Arrow in the 
first place.
+
+As is, clients must choose between either tedious integration work or leaving 
performance on the table. We can make this better.
+
+## Introducing ADBC
+
+ADBC is an Arrow-based, vendor-netural API for interacting with databases.
+Applications that use ADBC just get Arrow data.

Review Comment:
   Or "simply receive" if you want to emphasize that no conversion is necessary.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] ianmcook commented on a diff in pull request #248: [Website] Add ADBC blog post

2023-01-02 Thread GitBox



ianmcook commented on code in PR #248:
URL: https://github.com/apache/arrow-site/pull/248#discussion_r1060104986


##
_posts/2022-12-31-arrow-adbc.md:
##
@@ -0,0 +1,217 @@
+---
+layout: post
+title: "Introducing ADBC: Database Access for Apache Arrow"
+date: "2022-12-31 00:00:00"
+author: pmc
+categories: [application]
+---
+
+
+The Arrow community would like to introduce version 1.0.0 of the [Arrow 
Database Connectivity (ADBC)][adbc] specification.
+ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical 
applications.
+Or in other words: **ADBC is a single API for getting Arrow data in and out of 
different databases**.
+
+## Motivation
+
+Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to 
work with databases.
+That way, they can code to the same API regardless of the underlying database, 
saving on development time.
+Roughly speaking, when an application executes a query with these APIs:
+
+
+  
+  The query execution flow.
+
+
+1. The application submits a SQL query via the JDBC/ODBC API.
+2. The query is passed on to the driver.
+3. The driver translates the query to a database-specific protocol and sends 
it to the database.
+4. The database executes the query and returns the result set in a 
database-specific format.
+5. The driver translates the result format into the JDBC/ODBC API.
+6. The application iterates over the result rows using the JDBC/ODBC API.
+
+When columnar data comes into play, however, problems arise.
+JDBC is a row-oriented API, and while ODBC can support columnar data, the type 
system and data representation is not a perfect match with Arrow.
+So generally, columnar data must be converted to rows between steps 5 and 6, 
spending resources without performing "useful" work.
+
+This mismatch is problematic for columnar database systems, such as 
ClickHouse, Dremio, DuckDB, and Google BigQuery.
+On the client side, tools such as Apache Spark and pandas would be better off 
getting columnar data directly, skipping that conversion.
+Otherwise, they're leaving performance on the table.
+At the same time, that conversion isn't always avoidable.
+Row-oriented database systems like PostgreSQL aren't going away, and these 
clients will still want to consume data from them.
+
+Developers have a few options:
+
+- *Just use JDBC/ODBC*.
+  These standards are here to stay, and it makes sense for databases to 
support them for applications that want them.
+  But when both the database and the application are columnar, that means 
converting data into rows for JDBC/ODBC, only for the client to convert them 
right back into columns!
+  Performance suffers, and developers have to spend time implementing the 
conversions.
+- *Use JDBC/ODBC-to-Arrow conversion libraries*.
+  Libraries like [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc] handle row 
to columnar conversions for clients.
+  But this doesn't fundamentally solve the problem.
+  Unnecessary data conversions are still required.
+- *Use vendor-specific protocols*.
+  For some databases, applications can use a database-specific protocol or SDK 
to directly get Arrow data.
+  For example, applications could use Dremio via [Arrow Flight 
SQL][flight-sql].
+  But client applications that want to use multiple database vendors would 
need to integrate with each of them.
+  (Look at all the [connectors](https://trino.io/docs/current/connector.html) 
that Trino implements.)
+  And databases like PostgreSQL don't offer an option supporting Arrow in the 
first place.
+
+As is, clients must choose between either tedious integration work or leaving 
performance on the table. We can make this better.
+
+## Introducing ADBC
+
+ADBC is an Arrow-based, vendor-netural API for interacting with databases.
+Applications that use ADBC just get Arrow data.

Review Comment:
   ```suggestion
   Applications that use ADBC receive Arrow data.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] ianmcook commented on a diff in pull request #248: [Website] Add ADBC blog post

2023-01-02 Thread GitBox



ianmcook commented on code in PR #248:
URL: https://github.com/apache/arrow-site/pull/248#discussion_r1060103981


##
_posts/2022-12-31-arrow-adbc.md:
##
@@ -0,0 +1,217 @@
+---
+layout: post
+title: "Introducing ADBC: Database Access for Apache Arrow"
+date: "2022-12-31 00:00:00"
+author: pmc
+categories: [application]
+---
+
+
+The Arrow community would like to introduce version 1.0.0 of the [Arrow 
Database Connectivity (ADBC)][adbc] specification.
+ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical 
applications.
+Or in other words: **ADBC is a single API for getting Arrow data in and out of 
different databases**.
+
+## Motivation
+
+Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to 
work with databases.
+That way, they can code to the same API regardless of the underlying database, 
saving on development time.
+Roughly speaking, when an application executes a query with these APIs:
+
+
+  
+  The query execution flow.
+
+
+1. The application submits a SQL query via the JDBC/ODBC API.
+2. The query is passed on to the driver.
+3. The driver translates the query to a database-specific protocol and sends 
it to the database.
+4. The database executes the query and returns the result set in a 
database-specific format.
+5. The driver translates the result format into the JDBC/ODBC API.
+6. The application iterates over the result rows using the JDBC/ODBC API.
+
+When columnar data comes into play, however, problems arise.
+JDBC is a row-oriented API, and while ODBC can support columnar data, the type 
system and data representation is not a perfect match with Arrow.
+So generally, columnar data must be converted to rows between steps 5 and 6, 
spending resources without performing "useful" work.
+
+This mismatch is problematic for columnar database systems, such as 
ClickHouse, Dremio, DuckDB, and Google BigQuery.
+On the client side, tools such as Apache Spark and pandas would be better off 
getting columnar data directly, skipping that conversion.
+Otherwise, they're leaving performance on the table.
+At the same time, that conversion isn't always avoidable.
+Row-oriented database systems like PostgreSQL aren't going away, and these 
clients will still want to consume data from them.
+
+Developers have a few options:
+
+- *Just use JDBC/ODBC*.
+  These standards are here to stay, and it makes sense for databases to 
support them for applications that want them.
+  But when both the database and the application are columnar, that means 
converting data into rows for JDBC/ODBC, only for the client to convert them 
right back into columns!
+  Performance suffers, and developers have to spend time implementing the 
conversions.
+- *Use JDBC/ODBC-to-Arrow conversion libraries*.
+  Libraries like [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc] handle row 
to columnar conversions for clients.
+  But this doesn't fundamentally solve the problem.
+  Unnecessary data conversions are still required.
+- *Use vendor-specific protocols*.
+  For some databases, applications can use a database-specific protocol or SDK 
to directly get Arrow data.
+  For example, applications could use Dremio via [Arrow Flight 
SQL][flight-sql].
+  But client applications that want to use multiple database vendors would 
need to integrate with each of them.

Review Comment:
   ```suggestion
 But client applications that want to support multiple database vendors 
would need to integrate with each of them.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] ianmcook commented on a diff in pull request #248: [Website] Add ADBC blog post

2023-01-02 Thread GitBox



ianmcook commented on code in PR #248:
URL: https://github.com/apache/arrow-site/pull/248#discussion_r1060103099


##
_posts/2022-12-31-arrow-adbc.md:
##
@@ -0,0 +1,217 @@
+---
+layout: post
+title: "Introducing ADBC: Database Access for Apache Arrow"
+date: "2022-12-31 00:00:00"
+author: pmc
+categories: [application]
+---
+
+
+The Arrow community would like to introduce version 1.0.0 of the [Arrow 
Database Connectivity (ADBC)][adbc] specification.
+ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical 
applications.
+Or in other words: **ADBC is a single API for getting Arrow data in and out of 
different databases**.
+
+## Motivation
+
+Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to 
work with databases.
+That way, they can code to the same API regardless of the underlying database, 
saving on development time.
+Roughly speaking, when an application executes a query with these APIs:
+
+
+  
+  The query execution flow.
+
+
+1. The application submits a SQL query via the JDBC/ODBC API.
+2. The query is passed on to the driver.
+3. The driver translates the query to a database-specific protocol and sends 
it to the database.
+4. The database executes the query and returns the result set in a 
database-specific format.
+5. The driver translates the result format into the JDBC/ODBC API.
+6. The application iterates over the result rows using the JDBC/ODBC API.
+
+When columnar data comes into play, however, problems arise.
+JDBC is a row-oriented API, and while ODBC can support columnar data, the type 
system and data representation is not a perfect match with Arrow.
+So generally, columnar data must be converted to rows between steps 5 and 6, 
spending resources without performing "useful" work.
+
+This mismatch is problematic for columnar database systems, such as 
ClickHouse, Dremio, DuckDB, and Google BigQuery.
+On the client side, tools such as Apache Spark and pandas would be better off 
getting columnar data directly, skipping that conversion.
+Otherwise, they're leaving performance on the table.
+At the same time, that conversion isn't always avoidable.
+Row-oriented database systems like PostgreSQL aren't going away, and these 
clients will still want to consume data from them.
+
+Developers have a few options:
+
+- *Just use JDBC/ODBC*.
+  These standards are here to stay, and it makes sense for databases to 
support them for applications that want them.
+  But when both the database and the application are columnar, that means 
converting data into rows for JDBC/ODBC, only for the client to convert them 
right back into columns!
+  Performance suffers, and developers have to spend time implementing the 
conversions.
+- *Use JDBC/ODBC-to-Arrow conversion libraries*.
+  Libraries like [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc] handle row 
to columnar conversions for clients.

Review Comment:
   ```suggestion
 Libraries like [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc] handle 
row-to-columnar conversions for clients.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] ianmcook commented on a diff in pull request #248: [Website] Add ADBC blog post

2023-01-02 Thread GitBox



ianmcook commented on code in PR #248:
URL: https://github.com/apache/arrow-site/pull/248#discussion_r1060100843


##
_posts/2022-12-31-arrow-adbc.md:
##
@@ -0,0 +1,217 @@
+---
+layout: post
+title: "Introducing ADBC: Database Access for Apache Arrow"
+date: "2022-12-31 00:00:00"
+author: pmc
+categories: [application]
+---
+
+
+The Arrow community would like to introduce version 1.0.0 of the [Arrow 
Database Connectivity (ADBC)][adbc] specification.
+ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical 
applications.
+Or in other words: **ADBC is a single API for getting Arrow data in and out of 
different databases**.
+
+## Motivation
+
+Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to 
work with databases.
+That way, they can code to the same API regardless of the underlying database, 
saving on development time.
+Roughly speaking, when an application executes a query with these APIs:
+
+
+  
+  The query execution flow.
+
+
+1. The application submits a SQL query via the JDBC/ODBC API.
+2. The query is passed on to the driver.
+3. The driver translates the query to a database-specific protocol and sends 
it to the database.
+4. The database executes the query and returns the result set in a 
database-specific format.
+5. The driver translates the result format into the JDBC/ODBC API.
+6. The application iterates over the result rows using the JDBC/ODBC API.
+
+When columnar data comes into play, however, problems arise.
+JDBC is a row-oriented API, and while ODBC can support columnar data, the type 
system and data representation is not a perfect match with Arrow.
+So generally, columnar data must be converted to rows between steps 5 and 6, 
spending resources without performing "useful" work.

Review Comment:
   ```suggestion
   So generally, columnar data must be converted to rows in step 5, spending 
resources without performing "useful" work.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] ianmcook commented on a diff in pull request #248: [Website] Add ADBC blog post

2023-01-02 Thread GitBox



ianmcook commented on code in PR #248:
URL: https://github.com/apache/arrow-site/pull/248#discussion_r1060099812


##
_posts/2022-12-31-arrow-adbc.md:
##
@@ -0,0 +1,217 @@
+---
+layout: post
+title: "Introducing ADBC: Database Access for Apache Arrow"
+date: "2022-12-31 00:00:00"
+author: pmc
+categories: [application]
+---
+
+
+The Arrow community would like to introduce version 1.0.0 of the [Arrow 
Database Connectivity (ADBC)][adbc] specification.
+ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical 
applications.
+Or in other words: **ADBC is a single API for getting Arrow data in and out of 
different databases**.
+
+## Motivation
+
+Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to 
work with databases.
+That way, they can code to the same API regardless of the underlying database, 
saving on development time.
+Roughly speaking, when an application executes a query with these APIs:
+
+
+  
+  The query execution flow.
+
+
+1. The application submits a SQL query via the JDBC/ODBC API.
+2. The query is passed on to the driver.
+3. The driver translates the query to a database-specific protocol and sends 
it to the database.
+4. The database executes the query and returns the result set in a 
database-specific format.
+5. The driver translates the result format into the JDBC/ODBC API.

Review Comment:
   Is this what this means?
   ```suggestion
   5. The driver translates the result into the format required by the 
JDBC/ODBC API.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] paleolimbot commented on pull request #288: [Website] WIP: Add nanoarrow blog post

2023-01-02 Thread GitBox



paleolimbot commented on PR #288:
URL: https://github.com/apache/arrow-site/pull/288#issuecomment-1369021346

   Posting to the mailing list about this shortly...just adding that a rendered 
version of this that's easier to read can be found at 
https://github.com/paleolimbot/arrow-site/blob/nanoarrow-intro-post/_posts/2022-12-14-nanoarrow.md


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] lidavidm commented on pull request #248: [Website] Add ADBC blog post

2022-12-30 Thread GitBox



lidavidm commented on PR #248:
URL: https://github.com/apache/arrow-site/pull/248#issuecomment-1368096230

   @ksuarez1423, @ianmcook any final comments? (Especially since I rewrote the 
post quite heavily.)
   
   Thanks for all your help & happy New Year's!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] eitsupi commented on issue #291: [R] version selector is broken

2022-12-29 Thread GitBox



eitsupi commented on issue #291:
URL: https://github.com/apache/arrow-site/issues/291#issuecomment-1367675878

   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] kou commented on issue #291: [R] version selector is broken

2022-12-29 Thread GitBox



kou commented on issue #291:
URL: https://github.com/apache/arrow-site/issues/291#issuecomment-1367571443

   Deployed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] kou closed issue #291: [R] version selector is broken

2022-12-29 Thread GitBox



kou closed issue #291: [R] version selector is broken
URL: https://github.com/apache/arrow-site/issues/291


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] kou merged pull request #292: GH-291: [R] Update versions.json for 10.0.1

2022-12-29 Thread GitBox



kou merged PR #292:
URL: https://github.com/apache/arrow-site/pull/292


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] kou commented on issue #291: [R] version selector is broken

2022-12-28 Thread GitBox



kou commented on issue #291:
URL: https://github.com/apache/arrow-site/issues/291#issuecomment-1366966516

   This was fixed by https://github.com/apache/arrow/pull/14887 but it's not 
deployed yet. #292 will fix this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] kou opened a new pull request, #292: GH-291: [R] Update versions.json for 10.0.1

2022-12-28 Thread GitBox



kou opened a new pull request, #292:
URL: https://github.com/apache/arrow-site/pull/292

   Closes #291.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] eitsupi opened a new issue, #291: [R] version selector is broken

2022-12-28 Thread GitBox



eitsupi opened a new issue, #291:
URL: https://github.com/apache/arrow-site/issues/291

   The development version is displayed on the release version site and I 
cannot go from the release version to the development version site. ( It is 
possible to go to the development version site from the past version).
   
   
![image](https://user-images.githubusercontent.com/50911393/209799756-bae159f2-194b-44d3-a39a-5c412be9f88c.png)
   
   This is due to the fact that the following lines were not updated by #273?
   
   
https://github.com/apache/arrow-site/blob/6090a411a8ad51f7ec90d3b366cee19904bc03c1/docs/r/versions.json#L2-L9


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] alamb merged pull request #280: [WEBSITE]: Querying Parquet with Millisecond Latency

2022-12-26 Thread GitBox



alamb merged PR #280:
URL: https://github.com/apache/arrow-site/pull/280


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] alamb commented on pull request #280: [WEBSITE]: Querying Parquet with Millisecond Latency

2022-12-26 Thread GitBox



alamb commented on PR #280:
URL: https://github.com/apache/arrow-site/pull/280#issuecomment-1365316876

   Per the mailing list discussion 
https://lists.apache.org/thread/l377q5f20kyltb37m345p287kpo22qb6 I plan to 
publish this later today or tomorrow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] lidavidm commented on a diff in pull request #248: [Website] Add ADBC blog post

2022-12-22 Thread GitBox



lidavidm commented on code in PR #248:
URL: https://github.com/apache/arrow-site/pull/248#discussion_r1055873826


##
_posts/2022-12-31-arrow-adbc.md:
##
@@ -0,0 +1,217 @@
+---
+layout: post
+title: "Introducing ADBC: Database Access for Apache Arrow"
+date: "2022-12-31 00:00:00"
+author: pmc
+categories: [application]
+---
+
+
+The Arrow community would like to introduce version 1.0.0 of the [Arrow 
Database Connectivity (ADBC)][adbc] specification.
+ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical 
applications.
+Or in other words: **ADBC is a single API for getting Arrow data in and out of 
different databases**.
+
+## Motivation
+
+Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to 
work with databases.
+That way, they can code to the same API regardless of the underlying database, 
saving on development time.
+Roughly speaking, when an application executes a query with these APIs:
+
+
+  
+  The query execution flow.
+
+
+1. The application submits a SQL query via the JDBC/ODBC API.
+2. The query is passed on to the driver.
+3. The driver translates the query to a database-specific protocol and sends 
it to the database.
+4. The database executes the query and returns the result set in a 
database-specific format.
+5. The driver translates the result format into the JDBC/ODBC API.
+6. The application iterates over the result rows using the JDBC/ODBC API.
+
+When columnar data comes into play, however, problems arise.
+JDBC is a row-oriented API, and while ODBC can support columnar data, the type 
system and data representation is not a perfect match with Arrow.
+So generally, columnar data must be converted to rows between steps 5 and 6, 
spending resources without performing "useful" work.
+
+This mismatch is problematic for columnar database systems, such as 
ClickHouse, Dremio, DuckDB, and Google BigQuery.
+On the client side, tools such as Apache Spark and pandas would be better off 
getting columnar data directly, skipping that conversion.
+Otherwise, they're leaving performance on the table.
+At the same time, that conversion isn't always avoidable.
+Row-oriented database systems like PostgreSQL aren't going away, and these 
clients will still want to consume data from them.
+
+Developers have a few options:
+
+- *Just use JDBC/ODBC*.
+  These standards are here to stay, and it makes sense for databases to 
support them for applications that want them.
+  But when both the database and the application are columnar, that means 
converting data into rows for JDBC/ODBC, only for the client to convert them 
right back into columns!
+  Performance suffers, and developers have to spend time implementing the 
conversions.
+- *Use JDBC/ODBC-to-Arrow conversion libraries*.
+  Libraries like [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc] handle row 
to columnar conversions for clients.
+  But this doesn't fundamentally solve the problem.
+  Unnecessary data conversions are still required.
+- *Use vendor-specific protocols*.
+  For some databases, applications can use a database-specific protocol or SDK 
to directly get Arrow data.
+  For example, applications could use Dremio via [Arrow Flight 
SQL][flight-sql].
+  But client applications that want to use multiple database vendors would 
need to integrate with each of them.
+  (Look at all the [connectors](https://trino.io/docs/current/connector.html) 
that Trino implements.)
+  And databases like PostgreSQL don't offer an option supporting Arrow in the 
first place.
+
+So in the status quo, clients must choose between either tedious integration 
work or leaving performance on the table.

Review Comment:
   Thanks! Updated the preview.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] ksuarez1423 commented on a diff in pull request #248: [Website] Add ADBC blog post

2022-12-22 Thread GitBox



ksuarez1423 commented on code in PR #248:
URL: https://github.com/apache/arrow-site/pull/248#discussion_r1055763555


##
_posts/2022-12-31-arrow-adbc.md:
##
@@ -0,0 +1,217 @@
+---
+layout: post
+title: "Introducing ADBC: Database Access for Apache Arrow"
+date: "2022-12-31 00:00:00"
+author: pmc
+categories: [application]
+---
+
+
+The Arrow community would like to introduce version 1.0.0 of the [Arrow 
Database Connectivity (ADBC)][adbc] specification.
+ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical 
applications.
+Or in other words: **ADBC is a single API for getting Arrow data in and out of 
different databases**.
+
+## Motivation
+
+Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to 
work with databases.
+That way, they can code to the same API regardless of the underlying database, 
saving on development time.
+Roughly speaking, when an application executes a query with these APIs:
+
+
+  
+  The query execution flow.
+
+
+1. The application submits a SQL query via the JDBC/ODBC API.
+2. The query is passed on to the driver.
+3. The driver translates the query to a database-specific protocol and sends 
it to the database.
+4. The database executes the query and returns the result set in a 
database-specific format.
+5. The driver translates the result format into the JDBC/ODBC API.
+6. The application iterates over the result rows using the JDBC/ODBC API.
+
+When columnar data comes into play, however, problems arise.
+JDBC is a row-oriented API, and while ODBC can support columnar data, the type 
system and data representation is not a perfect match with Arrow.
+So generally, columnar data must be converted to rows between steps 5 and 6, 
spending resources without performing "useful" work.
+
+This mismatch is problematic for columnar database systems, such as 
ClickHouse, Dremio, DuckDB, and Google BigQuery.
+On the client side, tools such as Apache Spark and pandas would be better off 
getting columnar data directly, skipping that conversion.
+Otherwise, they're leaving performance on the table.
+At the same time, that conversion isn't always avoidable.
+Row-oriented database systems like PostgreSQL aren't going away, and these 
clients will still want to consume data from them.
+
+Developers have a few options:
+
+- *Just use JDBC/ODBC*.
+  These standards are here to stay, and it makes sense for databases to 
support them for applications that want them.
+  But when both the database and the application are columnar, that means 
converting data into rows for JDBC/ODBC, only for the client to convert them 
right back into columns!
+  Performance suffers, and developers have to spend time implementing the 
conversions.
+- *Use JDBC/ODBC-to-Arrow conversion libraries*.
+  Libraries like [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc] handle row 
to columnar conversions for clients.
+  But this doesn't fundamentally solve the problem.
+  Unnecessary data conversions are still required.
+- *Use vendor-specific protocols*.
+  For some databases, applications can use a database-specific protocol or SDK 
to directly get Arrow data.
+  For example, applications could use Dremio via [Arrow Flight 
SQL][flight-sql].
+  But client applications that want to use multiple database vendors would 
need to integrate with each of them.
+  (Look at all the [connectors](https://trino.io/docs/current/connector.html) 
that Trino implements.)
+  And databases like PostgreSQL don't offer an option supporting Arrow in the 
first place.
+
+So in the status quo, clients must choose between either tedious integration 
work or leaving performance on the table.

Review Comment:
   ```suggestion
   As is, clients must choose between either tedious integration work or 
leaving performance on the table. We can make this better.
   ```
   
   I think this could be punchier as a section ender.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] lidavidm commented on a diff in pull request #248: [Website] Add ADBC blog post

2022-12-22 Thread GitBox



lidavidm commented on code in PR #248:
URL: https://github.com/apache/arrow-site/pull/248#discussion_r1055655147


##
_posts/2022-12-31-arrow-adbc.md:
##
@@ -0,0 +1,217 @@
+---
+layout: post
+title: "Introducing ADBC: Database Access for Apache Arrow"
+date: "2022-12-31 00:00:00"
+author: pmc
+categories: [application]
+---
+
+
+The Arrow community would like to introduce version 1.0.0 of the [Arrow 
Database Connectivity (ADBC)][adbc] specification.
+ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical 
applications.
+Or in other words: **ADBC is a single API for getting Arrow data in and out of 
different databases**.
+
+## Motivation
+
+Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to 
work with databases.
+That way, they can code to the same API regardless of the underlying database, 
saving on development time.
+Roughly speaking, when an application executes a query with these APIs:
+
+1. The application submits a SQL query via the JDBC/ODBC API.
+2. The query is passed on to the driver.
+3. The driver translates the query to a database-specific protocol and sends 
it to the database.
+4. The database executes the query and returns the result set in a 
database-specific format.
+5. The driver translates the result format into the JDBC/ODBC API.
+6. The application iterates over the result rows using the JDBC/ODBC API.
+
+
+  
+  The query execution flow.
+
+
+When columnar data comes into play, however, problems arise.
+JDBC is a row-oriented API, and while ODBC can support columnar data, the type 
system and data representation is not a perfect match with Arrow.
+So generally, columnar data must be converted to rows between steps 5 and 6, 
spending resources without performing "useful" work.
+
+This mismatch is problematic for columnar database systems, such as 
ClickHouse, Dremio, DuckDB, and Google BigQuery.
+On the client side, tools such as Apache Spark and pandas would be better off 
getting columnar data directly, skipping that conversion.
+Otherwise, they're leaving performance on the table.
+At the same time, that conversion isn't always avoidable.
+Row-oriented database systems like PostgreSQL aren't going away, and these 
clients will still want to consume data from them.
+
+Developers have a few options:
+
+- *Just use JDBC/ODBC*.
+  These standards are here to stay, and it makes sense for databases to 
support them for applications that want them.
+  But when both the database and the application are columnar, that means 
converting data into rows for JDBC/ODBC, only for the client to convert them 
right back into columns!
+  Performance suffers, and developers have to spend time implementing the 
conversions.
+- *Use JDBC/ODBC to Arrow conversion libraries*.
+  Libraries like [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc] handle row 
to columnar conversions for clients.
+  But this doesn't fundamentally solve the problem.
+  Unnecessary data conversions are still required.
+- *Use vendor-specific protocols*.
+  For some databases, applications can use a database-specific protocol or SDK 
to directly get Arrow data.
+  For example, applications could use Dremio via [Arrow Flight 
SQL][flight-sql].
+  But client applications that want to use multiple database vendors would 
need to integrate with each of them.
+  (Look at all the [connectors](https://trino.io/docs/current/connector.html) 
that Trino implements.)
+  And databases like PostgreSQL don't offer an option supporting Arrow in the 
first place.
+
+So in the status quo, clients must choose between either tedious integration 
work or leaving performance on the table.
+
+## Introducing ADBC
+
+ADBC is an Arrow-based, vendor-netural API for interacting with databases.
+Applications that use ADBC just get Arrow data.
+They don't have to do any conversions themselves, and they don't have to 
integrate each database's specific SDK.
+
+Just like JDBC/ODBC, underneath the ADBC API are drivers that translate the 
API for specific databases.
+
+* A driver for an Arrow-native database just passes Arrow data through without 
conversion.
+* A driver for a non-Arrow-native database must convert the data to Arrow.
+  This saves the application from doing that, and the driver can optimize the 
conversion for its database.
+
+
+  
+  The query execution flow with two different ADBC 
drivers.
+
+

Review Comment:
   Updated (diagram precedes list in both places)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] ianmcook commented on a diff in pull request #248: [Website] Add ADBC blog post

2022-12-22 Thread GitBox



ianmcook commented on code in PR #248:
URL: https://github.com/apache/arrow-site/pull/248#discussion_r1055650626


##
_posts/2022-12-31-arrow-adbc.md:
##
@@ -0,0 +1,217 @@
+---
+layout: post
+title: "Introducing ADBC: Database Access for Apache Arrow"
+date: "2022-12-31 00:00:00"
+author: pmc
+categories: [application]
+---
+
+
+The Arrow community would like to introduce version 1.0.0 of the [Arrow 
Database Connectivity (ADBC)][adbc] specification.
+ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical 
applications.
+Or in other words: **ADBC is a single API for getting Arrow data in and out of 
different databases**.
+
+## Motivation
+
+Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to 
work with databases.
+That way, they can code to the same API regardless of the underlying database, 
saving on development time.
+Roughly speaking, when an application executes a query with these APIs:
+
+1. The application submits a SQL query via the JDBC/ODBC API.
+2. The query is passed on to the driver.
+3. The driver translates the query to a database-specific protocol and sends 
it to the database.
+4. The database executes the query and returns the result set in a 
database-specific format.
+5. The driver translates the result format into the JDBC/ODBC API.
+6. The application iterates over the result rows using the JDBC/ODBC API.
+
+
+  
+  The query execution flow.
+
+
+When columnar data comes into play, however, problems arise.
+JDBC is a row-oriented API, and while ODBC can support columnar data, the type 
system and data representation is not a perfect match with Arrow.
+So generally, columnar data must be converted to rows between steps 5 and 6, 
spending resources without performing "useful" work.
+
+This mismatch is problematic for columnar database systems, such as 
ClickHouse, Dremio, DuckDB, and Google BigQuery.
+On the client side, tools such as Apache Spark and pandas would be better off 
getting columnar data directly, skipping that conversion.
+Otherwise, they're leaving performance on the table.
+At the same time, that conversion isn't always avoidable.
+Row-oriented database systems like PostgreSQL aren't going away, and these 
clients will still want to consume data from them.
+
+Developers have a few options:
+
+- *Just use JDBC/ODBC*.
+  These standards are here to stay, and it makes sense for databases to 
support them for applications that want them.
+  But when both the database and the application are columnar, that means 
converting data into rows for JDBC/ODBC, only for the client to convert them 
right back into columns!
+  Performance suffers, and developers have to spend time implementing the 
conversions.
+- *Use JDBC/ODBC to Arrow conversion libraries*.
+  Libraries like [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc] handle row 
to columnar conversions for clients.
+  But this doesn't fundamentally solve the problem.
+  Unnecessary data conversions are still required.
+- *Use vendor-specific protocols*.
+  For some databases, applications can use a database-specific protocol or SDK 
to directly get Arrow data.
+  For example, applications could use Dremio via [Arrow Flight 
SQL][flight-sql].
+  But client applications that want to use multiple database vendors would 
need to integrate with each of them.
+  (Look at all the [connectors](https://trino.io/docs/current/connector.html) 
that Trino implements.)
+  And databases like PostgreSQL don't offer an option supporting Arrow in the 
first place.
+
+So in the status quo, clients must choose between either tedious integration 
work or leaving performance on the table.
+
+## Introducing ADBC
+
+ADBC is an Arrow-based, vendor-netural API for interacting with databases.
+Applications that use ADBC just get Arrow data.
+They don't have to do any conversions themselves, and they don't have to 
integrate each database's specific SDK.
+
+Just like JDBC/ODBC, underneath the ADBC API are drivers that translate the 
API for specific databases.
+
+* A driver for an Arrow-native database just passes Arrow data through without 
conversion.
+* A driver for a non-Arrow-native database must convert the data to Arrow.
+  This saves the application from doing that, and the driver can optimize the 
conversion for its database.
+
+
+  
+  The query execution flow with two different ADBC 
drivers.
+
+

Review Comment:
   Here the numbered list follows the diagram, whereas above, the numbered list 
precedes the diagram. It'd probably be best to have the order the same in both 
places.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] lidavidm commented on a diff in pull request #248: [Website] Add ADBC blog post

2022-12-22 Thread GitBox



lidavidm commented on code in PR #248:
URL: https://github.com/apache/arrow-site/pull/248#discussion_r1055647370


##
_posts/2022-12-31-arrow-adbc.md:
##
@@ -0,0 +1,217 @@
+---
+layout: post
+title: "Introducing ADBC: Database Access for Apache Arrow"
+date: "2022-12-31 00:00:00"
+author: pmc
+categories: [application]
+---
+
+
+The Arrow community would like to introduce version 1.0.0 of the [Arrow 
Database Connectivity (ADBC)][adbc] specification.
+ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical 
applications.
+Or in other words: **ADBC is a single API for getting Arrow data in and out of 
different databases**.
+
+## Motivation
+
+Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to 
work with databases.
+That way, they can code to the same API regardless of the underlying database, 
saving on development time.
+Roughly speaking, when an application executes a query with these APIs:
+
+1. The application submits a SQL query via the JDBC/ODBC API.
+2. The query is passed on to the driver.
+3. The driver translates the query to a database-specific protocol and sends 
it to the database.
+4. The database executes the query and returns the result set in a 
database-specific format.
+5. The driver translates the result format into the JDBC/ODBC API.
+6. The application iterates over the result rows using the JDBC/ODBC API.
+
+
+  
+  The query execution flow.
+
+
+When columnar data comes into play, however, problems arise.
+JDBC is a row-oriented API, and while ODBC can support columnar data, the type 
system and data representation is not a perfect match with Arrow.
+So generally, columnar data must be converted to rows between steps 5 and 6, 
spending resources without performing "useful" work.
+
+This mismatch is problematic for columnar database systems, such as 
ClickHouse, Dremio, DuckDB, and Google BigQuery.
+On the client side, tools such as Apache Spark and pandas would be better off 
getting columnar data directly, skipping that conversion.
+Otherwise, they're leaving performance on the table.
+At the same time, that conversion isn't always avoidable.
+Row-oriented database systems like PostgreSQL aren't going away, and these 
clients will still want to consume data from them.
+
+Developers have a few options:
+
+- *Just use JDBC/ODBC*.
+  These standards are here to stay, and it makes sense for databases to 
support them for applications that want them.
+  But when both the database and the application are columnar, that means 
converting data into rows for JDBC/ODBC, only for the client to convert them 
right back into columns!
+  Performance suffers, and developers have to spend time implementing the 
conversions.
+- *Use JDBC/ODBC to Arrow conversion libraries*.

Review Comment:
   thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] ianmcook commented on a diff in pull request #248: [Website] Add ADBC blog post

2022-12-22 Thread GitBox



ianmcook commented on code in PR #248:
URL: https://github.com/apache/arrow-site/pull/248#discussion_r1055645668


##
_posts/2022-12-31-arrow-adbc.md:
##
@@ -0,0 +1,217 @@
+---
+layout: post
+title: "Introducing ADBC: Database Access for Apache Arrow"
+date: "2022-12-31 00:00:00"
+author: pmc
+categories: [application]
+---
+
+
+The Arrow community would like to introduce version 1.0.0 of the [Arrow 
Database Connectivity (ADBC)][adbc] specification.
+ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical 
applications.
+Or in other words: **ADBC is a single API for getting Arrow data in and out of 
different databases**.
+
+## Motivation
+
+Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to 
work with databases.
+That way, they can code to the same API regardless of the underlying database, 
saving on development time.
+Roughly speaking, when an application executes a query with these APIs:
+
+1. The application submits a SQL query via the JDBC/ODBC API.
+2. The query is passed on to the driver.
+3. The driver translates the query to a database-specific protocol and sends 
it to the database.
+4. The database executes the query and returns the result set in a 
database-specific format.
+5. The driver translates the result format into the JDBC/ODBC API.
+6. The application iterates over the result rows using the JDBC/ODBC API.
+
+
+  
+  The query execution flow.
+
+
+When columnar data comes into play, however, problems arise.
+JDBC is a row-oriented API, and while ODBC can support columnar data, the type 
system and data representation is not a perfect match with Arrow.
+So generally, columnar data must be converted to rows between steps 5 and 6, 
spending resources without performing "useful" work.
+
+This mismatch is problematic for columnar database systems, such as 
ClickHouse, Dremio, DuckDB, and Google BigQuery.
+On the client side, tools such as Apache Spark and pandas would be better off 
getting columnar data directly, skipping that conversion.
+Otherwise, they're leaving performance on the table.
+At the same time, that conversion isn't always avoidable.
+Row-oriented database systems like PostgreSQL aren't going away, and these 
clients will still want to consume data from them.
+
+Developers have a few options:
+
+- *Just use JDBC/ODBC*.
+  These standards are here to stay, and it makes sense for databases to 
support them for applications that want them.
+  But when both the database and the application are columnar, that means 
converting data into rows for JDBC/ODBC, only for the client to convert them 
right back into columns!
+  Performance suffers, and developers have to spend time implementing the 
conversions.
+- *Use JDBC/ODBC to Arrow conversion libraries*.

Review Comment:
   reads more clearly with hyphens
   ```suggestion
   - *Use JDBC/ODBC-to-Arrow conversion libraries*.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] lidavidm commented on a diff in pull request #248: [Website] Add ADBC blog post

2022-12-20 Thread GitBox



lidavidm commented on code in PR #248:
URL: https://github.com/apache/arrow-site/pull/248#discussion_r1053762319


##
_posts/2022-12-31-arrow-adbc.md:
##
@@ -0,0 +1,252 @@
+---
+layout: post
+title: "Introducing ADBC: Database Access for Apache Arrow"
+date: "2022-12-31 00:00:00"
+author: pmc
+categories: [application]
+---
+
+
+The Arrow community would like to introduce version 1.0.0 of the [Arrow 
Database Connectivity (ADBC)][adbc] specification.
+**ADBC aims to be an columnar, minimal-overhead alternative JDBC/ODBC for 
analytical applications**.
+It defines vendor-agnostic and Arrow-based APIs for common database tasks, 
like executing queries and getting basic metadata.
+These APIs are available, either directly or via bindings, in C/C++, Go, Java, 
Python, Ruby, and soon R.
+
+With ADBC, developers get both the benefits of using columnar Arrow data and 
having generic API abstractions.
+Like [JDBC][jdbc]/[ODBC][odbc], ADBC defines database-independent interaction 
APIs, and relies on drivers to implement those APIs for particular databases.
+ADBC aims to bring all of these together under a single API:
+
+- Vendor-specific Arrow-native protocols, like [Arrow Flight SQL][flight-sql] 
or those offered by ClickHouse or Google BigQuery;
+- Non-columnar protocols, like the PostgreSQL wire format;
+- Non-columnar API abstractions, like JDBC/ODBC.
+
+In other words: **ADBC is a single API for getting Arrow data in and out of 
databases**.
+Underneath, ADBC driver implementations take care of bridging the actual 
system:
+
+- Databases with Arrow-native protocols can directly pass data through without 
conversion.
+- Otherwise, drivers can be built for specific row-based protocols, optimizing 
conversions to and from Arrow data as best as possible for particular databases.
+- As a fallback, drivers can be built that convert data from JDBC/ODBC, 
bridging existing databases into an Arrow-native API.
+
+In all cases, the application is saved the trouble of wrapping APIs and doing 
data conversions.
+
+## Motivation
+
+Applications often use API standards like JDBC and ODBC to work with databases.
+This lets them use the same API regardless of the underlying database, saving 
on development time.
+Roughly speaking, when an application executes a query with these APIs:
+
+1. The application submits a SQL query via the JDBC/ODBC APIs.
+2. The query is passed on to the driver.
+3. The driver translates the query to a database-specific protocol and sends 
it to the database.
+4. The database executes the query and returns the result set in a 
database-specific format.
+5. The driver translates the result format into the JDBC/ODBC API.
+
+
+  
+  The query execution flow.
+
+
+When columnar data comes into play, however, problems arise.
+JDBC is a row-oriented API, and while ODBC can support columnar data, the type 
system and data representation is not a perfect match with Arrow.
+In both cases, this leads to data conversions around steps 4–5, spending 
resources without performing "useful" work.
+
+This mismatch is important for columnar database systems, such as ClickHouse, 
Dremio, DuckDB, Google BigQuery, and others.
+Clients, such as Apache Spark and pandas, would like to get columnar data 
directly from these systems.
+Meanwhile, traditional database systems aren't going away, and these clients 
still want to consume data from them.
+
+In response, we've seen a few solutions:
+
+- *Just provide JDBC/ODBC drivers*.
+  These standards are here to stay, and it makes sense to provide these 
interfaces for applications that want them.
+  But if both sides are columnar, that means converting data into rows for 
JDBC/ODBC, only for the client to convert them back into columns!
+- *Provide converters from JDBC/ODBC to Arrow*.
+  Some examples include [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc].
+  This approach reduces the burden on client applications, but doesn't 
fundamentally solve the problem.
+  Unnecessary data conversions are still required.
+- *Provide special SDKs*.
+  All of the columnar systems listed above do offer ways to get Arrow data, 
such as via [Arrow Flight SQL][flight-sql].
+  But client applications need to spend time to integrate with each of them.
+  (Just look at all the 
[connectors](https://trino.io/docs/current/connector.html) that Trino 
implements.)
+  And not every system offers this option.
+
+ADBC combines the advantages of the latter two solutions under one API.
+In other words, ADBC provides a set of API definitions that client 
applications code to.
+These API definitions are Arrow-based.
+The application then links to or loads drivers for the actual database, which 
implement the API definitions.
+If the database is Arrow-native, the driver can just pass the data through 
without conversion.
+Otherwise, the driver converts the data to Arrow format first.
+
+
+  
+  The query execution flow with two different ADBC 
drivers.
+
+
+1. The application submits

[GitHub] [arrow-site] lidavidm commented on a diff in pull request #248: [Website] Add ADBC blog post

2022-12-20 Thread GitBox



lidavidm commented on code in PR #248:
URL: https://github.com/apache/arrow-site/pull/248#discussion_r1053730160


##
_posts/2022-12-31-arrow-adbc.md:
##
@@ -0,0 +1,252 @@
+---
+layout: post
+title: "Introducing ADBC: Database Access for Apache Arrow"
+date: "2022-12-31 00:00:00"
+author: pmc
+categories: [application]
+---
+
+
+The Arrow community would like to introduce version 1.0.0 of the [Arrow 
Database Connectivity (ADBC)][adbc] specification.
+**ADBC aims to be an columnar, minimal-overhead alternative JDBC/ODBC for 
analytical applications**.
+It defines vendor-agnostic and Arrow-based APIs for common database tasks, 
like executing queries and getting basic metadata.
+These APIs are available, either directly or via bindings, in C/C++, Go, Java, 
Python, Ruby, and soon R.
+
+With ADBC, developers get both the benefits of using columnar Arrow data and 
having generic API abstractions.
+Like [JDBC][jdbc]/[ODBC][odbc], ADBC defines database-independent interaction 
APIs, and relies on drivers to implement those APIs for particular databases.
+ADBC aims to bring all of these together under a single API:
+
+- Vendor-specific Arrow-native protocols, like [Arrow Flight SQL][flight-sql] 
or those offered by ClickHouse or Google BigQuery;
+- Non-columnar protocols, like the PostgreSQL wire format;
+- Non-columnar API abstractions, like JDBC/ODBC.
+
+In other words: **ADBC is a single API for getting Arrow data in and out of 
databases**.
+Underneath, ADBC driver implementations take care of bridging the actual 
system:
+
+- Databases with Arrow-native protocols can directly pass data through without 
conversion.
+- Otherwise, drivers can be built for specific row-based protocols, optimizing 
conversions to and from Arrow data as best as possible for particular databases.
+- As a fallback, drivers can be built that convert data from JDBC/ODBC, 
bridging existing databases into an Arrow-native API.
+
+In all cases, the application is saved the trouble of wrapping APIs and doing 
data conversions.
+
+## Motivation
+
+Applications often use API standards like JDBC and ODBC to work with databases.
+This lets them use the same API regardless of the underlying database, saving 
on development time.
+Roughly speaking, when an application executes a query with these APIs:
+
+1. The application submits a SQL query via the JDBC/ODBC APIs.
+2. The query is passed on to the driver.
+3. The driver translates the query to a database-specific protocol and sends 
it to the database.
+4. The database executes the query and returns the result set in a 
database-specific format.
+5. The driver translates the result format into the JDBC/ODBC API.
+
+
+  
+  The query execution flow.
+
+
+When columnar data comes into play, however, problems arise.
+JDBC is a row-oriented API, and while ODBC can support columnar data, the type 
system and data representation is not a perfect match with Arrow.
+In both cases, this leads to data conversions around steps 4–5, spending 
resources without performing "useful" work.
+
+This mismatch is important for columnar database systems, such as ClickHouse, 
Dremio, DuckDB, Google BigQuery, and others.
+Clients, such as Apache Spark and pandas, would like to get columnar data 
directly from these systems.
+Meanwhile, traditional database systems aren't going away, and these clients 
still want to consume data from them.
+
+In response, we've seen a few solutions:
+
+- *Just provide JDBC/ODBC drivers*.
+  These standards are here to stay, and it makes sense to provide these 
interfaces for applications that want them.
+  But if both sides are columnar, that means converting data into rows for 
JDBC/ODBC, only for the client to convert them back into columns!
+- *Provide converters from JDBC/ODBC to Arrow*.
+  Some examples include [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc].
+  This approach reduces the burden on client applications, but doesn't 
fundamentally solve the problem.
+  Unnecessary data conversions are still required.
+- *Provide special SDKs*.
+  All of the columnar systems listed above do offer ways to get Arrow data, 
such as via [Arrow Flight SQL][flight-sql].
+  But client applications need to spend time to integrate with each of them.
+  (Just look at all the 
[connectors](https://trino.io/docs/current/connector.html) that Trino 
implements.)
+  And not every system offers this option.
+
+ADBC combines the advantages of the latter two solutions under one API.
+In other words, ADBC provides a set of API definitions that client 
applications code to.
+These API definitions are Arrow-based.
+The application then links to or loads drivers for the actual database, which 
implement the API definitions.
+If the database is Arrow-native, the driver can just pass the data through 
without conversion.
+Otherwise, the driver converts the data to Arrow format first.
+
+
+  
+  The query execution flow with two different ADBC 
drivers.
+
+
+1. The application submits

[GitHub] [arrow-site] ksuarez1423 commented on a diff in pull request #248: [Website] Add ADBC blog post

2022-12-20 Thread GitBox



ksuarez1423 commented on code in PR #248:
URL: https://github.com/apache/arrow-site/pull/248#discussion_r1053672404


##
_posts/2022-12-31-arrow-adbc.md:
##
@@ -0,0 +1,252 @@
+---
+layout: post
+title: "Introducing ADBC: Database Access for Apache Arrow"
+date: "2022-12-31 00:00:00"
+author: pmc
+categories: [application]
+---
+
+
+The Arrow community would like to introduce version 1.0.0 of the [Arrow 
Database Connectivity (ADBC)][adbc] specification.
+**ADBC aims to be an columnar, minimal-overhead alternative JDBC/ODBC for 
analytical applications**.
+It defines vendor-agnostic and Arrow-based APIs for common database tasks, 
like executing queries and getting basic metadata.
+These APIs are available, either directly or via bindings, in C/C++, Go, Java, 
Python, Ruby, and soon R.
+
+With ADBC, developers get both the benefits of using columnar Arrow data and 
having generic API abstractions.
+Like [JDBC][jdbc]/[ODBC][odbc], ADBC defines database-independent interaction 
APIs, and relies on drivers to implement those APIs for particular databases.
+ADBC aims to bring all of these together under a single API:
+
+- Vendor-specific Arrow-native protocols, like [Arrow Flight SQL][flight-sql] 
or those offered by ClickHouse or Google BigQuery;
+- Non-columnar protocols, like the PostgreSQL wire format;
+- Non-columnar API abstractions, like JDBC/ODBC.
+
+In other words: **ADBC is a single API for getting Arrow data in and out of 
databases**.
+Underneath, ADBC driver implementations take care of bridging the actual 
system:
+
+- Databases with Arrow-native protocols can directly pass data through without 
conversion.
+- Otherwise, drivers can be built for specific row-based protocols, optimizing 
conversions to and from Arrow data as best as possible for particular databases.
+- As a fallback, drivers can be built that convert data from JDBC/ODBC, 
bridging existing databases into an Arrow-native API.
+
+In all cases, the application is saved the trouble of wrapping APIs and doing 
data conversions.
+
+## Motivation
+
+Applications often use API standards like JDBC and ODBC to work with databases.
+This lets them use the same API regardless of the underlying database, saving 
on development time.
+Roughly speaking, when an application executes a query with these APIs:
+
+1. The application submits a SQL query via the JDBC/ODBC APIs.
+2. The query is passed on to the driver.
+3. The driver translates the query to a database-specific protocol and sends 
it to the database.
+4. The database executes the query and returns the result set in a 
database-specific format.
+5. The driver translates the result format into the JDBC/ODBC API.
+
+
+  
+  The query execution flow.
+
+
+When columnar data comes into play, however, problems arise.
+JDBC is a row-oriented API, and while ODBC can support columnar data, the type 
system and data representation is not a perfect match with Arrow.
+In both cases, this leads to data conversions around steps 4–5, spending 
resources without performing "useful" work.
+
+This mismatch is important for columnar database systems, such as ClickHouse, 
Dremio, DuckDB, Google BigQuery, and others.
+Clients, such as Apache Spark and pandas, would like to get columnar data 
directly from these systems.
+Meanwhile, traditional database systems aren't going away, and these clients 
still want to consume data from them.
+
+In response, we've seen a few solutions:
+
+- *Just provide JDBC/ODBC drivers*.
+  These standards are here to stay, and it makes sense to provide these 
interfaces for applications that want them.
+  But if both sides are columnar, that means converting data into rows for 
JDBC/ODBC, only for the client to convert them back into columns!
+- *Provide converters from JDBC/ODBC to Arrow*.
+  Some examples include [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc].
+  This approach reduces the burden on client applications, but doesn't 
fundamentally solve the problem.
+  Unnecessary data conversions are still required.
+- *Provide special SDKs*.
+  All of the columnar systems listed above do offer ways to get Arrow data, 
such as via [Arrow Flight SQL][flight-sql].
+  But client applications need to spend time to integrate with each of them.
+  (Just look at all the 
[connectors](https://trino.io/docs/current/connector.html) that Trino 
implements.)
+  And not every system offers this option.
+
+ADBC combines the advantages of the latter two solutions under one API.
+In other words, ADBC provides a set of API definitions that client 
applications code to.
+These API definitions are Arrow-based.
+The application then links to or loads drivers for the actual database, which 
implement the API definitions.
+If the database is Arrow-native, the driver can just pass the data through 
without conversion.
+Otherwise, the driver converts the data to Arrow format first.
+
+
+  
+  The query execution flow with two different ADBC 
drivers.
+
+
+1. The application

[GitHub] [arrow-site] ksuarez1423 commented on a diff in pull request #248: [Website] Add ADBC blog post

2022-12-20 Thread GitBox



ksuarez1423 commented on code in PR #248:
URL: https://github.com/apache/arrow-site/pull/248#discussion_r1053672404


##
_posts/2022-12-31-arrow-adbc.md:
##
@@ -0,0 +1,252 @@
+---
+layout: post
+title: "Introducing ADBC: Database Access for Apache Arrow"
+date: "2022-12-31 00:00:00"
+author: pmc
+categories: [application]
+---
+
+
+The Arrow community would like to introduce version 1.0.0 of the [Arrow 
Database Connectivity (ADBC)][adbc] specification.
+**ADBC aims to be an columnar, minimal-overhead alternative JDBC/ODBC for 
analytical applications**.
+It defines vendor-agnostic and Arrow-based APIs for common database tasks, 
like executing queries and getting basic metadata.
+These APIs are available, either directly or via bindings, in C/C++, Go, Java, 
Python, Ruby, and soon R.
+
+With ADBC, developers get both the benefits of using columnar Arrow data and 
having generic API abstractions.
+Like [JDBC][jdbc]/[ODBC][odbc], ADBC defines database-independent interaction 
APIs, and relies on drivers to implement those APIs for particular databases.
+ADBC aims to bring all of these together under a single API:
+
+- Vendor-specific Arrow-native protocols, like [Arrow Flight SQL][flight-sql] 
or those offered by ClickHouse or Google BigQuery;
+- Non-columnar protocols, like the PostgreSQL wire format;
+- Non-columnar API abstractions, like JDBC/ODBC.
+
+In other words: **ADBC is a single API for getting Arrow data in and out of 
databases**.
+Underneath, ADBC driver implementations take care of bridging the actual 
system:
+
+- Databases with Arrow-native protocols can directly pass data through without 
conversion.
+- Otherwise, drivers can be built for specific row-based protocols, optimizing 
conversions to and from Arrow data as best as possible for particular databases.
+- As a fallback, drivers can be built that convert data from JDBC/ODBC, 
bridging existing databases into an Arrow-native API.
+
+In all cases, the application is saved the trouble of wrapping APIs and doing 
data conversions.
+
+## Motivation
+
+Applications often use API standards like JDBC and ODBC to work with databases.
+This lets them use the same API regardless of the underlying database, saving 
on development time.
+Roughly speaking, when an application executes a query with these APIs:
+
+1. The application submits a SQL query via the JDBC/ODBC APIs.
+2. The query is passed on to the driver.
+3. The driver translates the query to a database-specific protocol and sends 
it to the database.
+4. The database executes the query and returns the result set in a 
database-specific format.
+5. The driver translates the result format into the JDBC/ODBC API.
+
+
+  
+  The query execution flow.
+
+
+When columnar data comes into play, however, problems arise.
+JDBC is a row-oriented API, and while ODBC can support columnar data, the type 
system and data representation is not a perfect match with Arrow.
+In both cases, this leads to data conversions around steps 4–5, spending 
resources without performing "useful" work.
+
+This mismatch is important for columnar database systems, such as ClickHouse, 
Dremio, DuckDB, Google BigQuery, and others.
+Clients, such as Apache Spark and pandas, would like to get columnar data 
directly from these systems.
+Meanwhile, traditional database systems aren't going away, and these clients 
still want to consume data from them.
+
+In response, we've seen a few solutions:
+
+- *Just provide JDBC/ODBC drivers*.
+  These standards are here to stay, and it makes sense to provide these 
interfaces for applications that want them.
+  But if both sides are columnar, that means converting data into rows for 
JDBC/ODBC, only for the client to convert them back into columns!
+- *Provide converters from JDBC/ODBC to Arrow*.
+  Some examples include [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc].
+  This approach reduces the burden on client applications, but doesn't 
fundamentally solve the problem.
+  Unnecessary data conversions are still required.
+- *Provide special SDKs*.
+  All of the columnar systems listed above do offer ways to get Arrow data, 
such as via [Arrow Flight SQL][flight-sql].
+  But client applications need to spend time to integrate with each of them.
+  (Just look at all the 
[connectors](https://trino.io/docs/current/connector.html) that Trino 
implements.)
+  And not every system offers this option.
+
+ADBC combines the advantages of the latter two solutions under one API.
+In other words, ADBC provides a set of API definitions that client 
applications code to.
+These API definitions are Arrow-based.
+The application then links to or loads drivers for the actual database, which 
implement the API definitions.
+If the database is Arrow-native, the driver can just pass the data through 
without conversion.
+Otherwise, the driver converts the data to Arrow format first.
+
+
+  
+  The query execution flow with two different ADBC 
drivers.
+
+
+1. The application

[GitHub] [arrow-site] lidavidm commented on a diff in pull request #248: [Website] Add ADBC blog post

2022-12-20 Thread GitBox



lidavidm commented on code in PR #248:
URL: https://github.com/apache/arrow-site/pull/248#discussion_r1053653049


##
_posts/2022-12-31-arrow-adbc.md:
##
@@ -0,0 +1,252 @@
+---
+layout: post
+title: "Introducing ADBC: Database Access for Apache Arrow"
+date: "2022-12-31 00:00:00"
+author: pmc
+categories: [application]
+---
+
+
+The Arrow community would like to introduce version 1.0.0 of the [Arrow 
Database Connectivity (ADBC)][adbc] specification.
+**ADBC aims to be an columnar, minimal-overhead alternative JDBC/ODBC for 
analytical applications**.
+It defines vendor-agnostic and Arrow-based APIs for common database tasks, 
like executing queries and getting basic metadata.
+These APIs are available, either directly or via bindings, in C/C++, Go, Java, 
Python, Ruby, and soon R.
+
+With ADBC, developers get both the benefits of using columnar Arrow data and 
having generic API abstractions.
+Like [JDBC][jdbc]/[ODBC][odbc], ADBC defines database-independent interaction 
APIs, and relies on drivers to implement those APIs for particular databases.
+ADBC aims to bring all of these together under a single API:
+
+- Vendor-specific Arrow-native protocols, like [Arrow Flight SQL][flight-sql] 
or those offered by ClickHouse or Google BigQuery;
+- Non-columnar protocols, like the PostgreSQL wire format;
+- Non-columnar API abstractions, like JDBC/ODBC.
+
+In other words: **ADBC is a single API for getting Arrow data in and out of 
databases**.
+Underneath, ADBC driver implementations take care of bridging the actual 
system:
+
+- Databases with Arrow-native protocols can directly pass data through without 
conversion.
+- Otherwise, drivers can be built for specific row-based protocols, optimizing 
conversions to and from Arrow data as best as possible for particular databases.
+- As a fallback, drivers can be built that convert data from JDBC/ODBC, 
bridging existing databases into an Arrow-native API.
+
+In all cases, the application is saved the trouble of wrapping APIs and doing 
data conversions.
+
+## Motivation
+
+Applications often use API standards like JDBC and ODBC to work with databases.
+This lets them use the same API regardless of the underlying database, saving 
on development time.
+Roughly speaking, when an application executes a query with these APIs:
+
+1. The application submits a SQL query via the JDBC/ODBC APIs.
+2. The query is passed on to the driver.
+3. The driver translates the query to a database-specific protocol and sends 
it to the database.
+4. The database executes the query and returns the result set in a 
database-specific format.
+5. The driver translates the result format into the JDBC/ODBC API.
+
+
+  
+  The query execution flow.
+
+
+When columnar data comes into play, however, problems arise.
+JDBC is a row-oriented API, and while ODBC can support columnar data, the type 
system and data representation is not a perfect match with Arrow.
+In both cases, this leads to data conversions around steps 4–5, spending 
resources without performing "useful" work.
+
+This mismatch is important for columnar database systems, such as ClickHouse, 
Dremio, DuckDB, Google BigQuery, and others.
+Clients, such as Apache Spark and pandas, would like to get columnar data 
directly from these systems.
+Meanwhile, traditional database systems aren't going away, and these clients 
still want to consume data from them.
+
+In response, we've seen a few solutions:
+
+- *Just provide JDBC/ODBC drivers*.
+  These standards are here to stay, and it makes sense to provide these 
interfaces for applications that want them.
+  But if both sides are columnar, that means converting data into rows for 
JDBC/ODBC, only for the client to convert them back into columns!
+- *Provide converters from JDBC/ODBC to Arrow*.
+  Some examples include [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc].
+  This approach reduces the burden on client applications, but doesn't 
fundamentally solve the problem.
+  Unnecessary data conversions are still required.
+- *Provide special SDKs*.
+  All of the columnar systems listed above do offer ways to get Arrow data, 
such as via [Arrow Flight SQL][flight-sql].
+  But client applications need to spend time to integrate with each of them.
+  (Just look at all the 
[connectors](https://trino.io/docs/current/connector.html) that Trino 
implements.)
+  And not every system offers this option.
+
+ADBC combines the advantages of the latter two solutions under one API.
+In other words, ADBC provides a set of API definitions that client 
applications code to.
+These API definitions are Arrow-based.
+The application then links to or loads drivers for the actual database, which 
implement the API definitions.
+If the database is Arrow-native, the driver can just pass the data through 
without conversion.
+Otherwise, the driver converts the data to Arrow format first.
+
+
+  
+  The query execution flow with two different ADBC 
drivers.
+
+
+1. The application submits

[GitHub] [arrow-site] lidavidm commented on a diff in pull request #248: [Website] Add ADBC blog post

2022-12-20 Thread GitBox



lidavidm commented on code in PR #248:
URL: https://github.com/apache/arrow-site/pull/248#discussion_r1053650752


##
_posts/2022-12-31-arrow-adbc.md:
##
@@ -0,0 +1,252 @@
+---
+layout: post
+title: "Introducing ADBC: Database Access for Apache Arrow"
+date: "2022-12-31 00:00:00"
+author: pmc
+categories: [application]
+---
+
+
+The Arrow community would like to introduce version 1.0.0 of the [Arrow 
Database Connectivity (ADBC)][adbc] specification.
+**ADBC aims to be an columnar, minimal-overhead alternative JDBC/ODBC for 
analytical applications**.
+It defines vendor-agnostic and Arrow-based APIs for common database tasks, 
like executing queries and getting basic metadata.
+These APIs are available, either directly or via bindings, in C/C++, Go, Java, 
Python, Ruby, and soon R.
+
+With ADBC, developers get both the benefits of using columnar Arrow data and 
having generic API abstractions.
+Like [JDBC][jdbc]/[ODBC][odbc], ADBC defines database-independent interaction 
APIs, and relies on drivers to implement those APIs for particular databases.
+ADBC aims to bring all of these together under a single API:
+
+- Vendor-specific Arrow-native protocols, like [Arrow Flight SQL][flight-sql] 
or those offered by ClickHouse or Google BigQuery;
+- Non-columnar protocols, like the PostgreSQL wire format;
+- Non-columnar API abstractions, like JDBC/ODBC.
+
+In other words: **ADBC is a single API for getting Arrow data in and out of 
databases**.

Review Comment:
   Right, three classes within a single API. I'll think about rewording this a 
bit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] ianmcook merged pull request #290: [Website] Add ADBC to Subprojects menu

2022-12-20 Thread GitBox



ianmcook merged PR #290:
URL: https://github.com/apache/arrow-site/pull/290


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-site] github-actions[bot] commented on pull request #290: [Website] Add ADBC to Subprojects menu

2022-12-20 Thread GitBox



github-actions[bot] commented on PR #290:
URL: https://github.com/apache/arrow-site/pull/290#issuecomment-1359609916

   
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
   ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
 * [Other pull requests](https://github.com/apache/arrow-site/pulls/)
 * [Contribution Guidelines - How to contribute 
patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2295 matches

Mail list logo