This is an automated email from the ASF dual-hosted git repository.
raulcd pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-site.git
The following commit(s) were added to refs/heads/main by this push:
new 28c3cd4ffc7 Website: Add blog post for 16.0.0 (#500)
28c3cd4ffc7 is described below
commit 28c3cd4ffc7552c973e59eebd365962db72fe77a
Author: Raúl Cumplido <[email protected]>
AuthorDate: Mon Apr 29 15:31:34 2024 +0200
Website: Add blog post for 16.0.0 (#500)
Co-authored-by: David Li <[email protected]>
Co-authored-by: Curt Hagenlocher <[email protected]>
Co-authored-by: Matt Topol <[email protected]>
Co-authored-by: Alenka Frim <[email protected]>
Co-authored-by: Dane Pitkin <[email protected]>
Co-authored-by: Bryce Mecum <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
---
_posts/2024-04-20-16.0.0-release.md | 299 ++++++++++++++++++++++++++++++++++++
1 file changed, 299 insertions(+)
diff --git a/_posts/2024-04-20-16.0.0-release.md
b/_posts/2024-04-20-16.0.0-release.md
new file mode 100644
index 00000000000..44fe02c6290
--- /dev/null
+++ b/_posts/2024-04-20-16.0.0-release.md
@@ -0,0 +1,299 @@
+---
+layout: post
+title: "Apache Arrow 16.0.0 Release"
+date: "2024-04-20 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+
+The Apache Arrow team is pleased to announce the 16.0.0 release. This covers
+over 3 months of development work and includes [**385 resolved issues**][1]
+on [**586 distinct commits**][2] from [**119 distinct contributors**][2].
+See the [Install Page](https://arrow.apache.org/install/)
+to learn how to get the libraries for your platform.
+
+The release notes below are not exhaustive and only expose selected highlights
+of the release. Many other bugfixes and improvements have been made: we refer
+you to the [complete changelog][3].
+
+## Community
+
+Since the 15.0.0 release, Jeffrey Vo, Jay Zhan, Bryce Mecum, Joel Lubinitsky,
+and Sarah Gilmore have been invited to be committers.
+No new members have joined the Project Management Committee (PMC).
+
+Thanks for your contributions and participation in the project!
+
+## C Data Interface notes
+
+- Added `RegisterDeviceMemoryManager` and `GetDeviceMemoryManage` for managing
mappings between a device type and id to a memory manager
([GH-40698](https://github.com/apache/arrow/issues/40698)).
+- Added `RegisterCUDADevice` to register CUDA devices
([GH-40698](https://github.com/apache/arrow/issues/40698)).
+- Added `ImportFromChunkedArray` and `ExportChunkedArray` for handling Chunked
Arrays in the C Stream Interface
([GH-38717](https://github.com/apache/arrow/issues/38717)).
+- Fixed an issue where string and nested types weren’t being correctly
imported with DeviceArray
([GH-39769](https://github.com/apache/arrow/issues/39769)).
+- Added support for copying Arrays and RecordBatches between memory types
([GH-39771](https://github.com/apache/arrow/issues/39771)).
+
+## Arrow Flight RPC notes
+
+- Session variable RPCs were added
([GH-34865](https://github.com/apache/arrow/issues/34865))
+- Go: cookies can be copied to another connection to reuse existing
credentials ([GH-39837](https://github.com/apache/arrow/issues/39837))
+- Go: enable PollFlightInfo for Flight SQL clients/servers
([GH-39574](https://github.com/apache/arrow/issues/39574))
+- Java: the JDBC driver now tries all locations the server sends it
([GH-38573](https://github.com/apache/arrow/issues/38573))
+- Java: tweak some options to give better performance
([GH-40475](https://github.com/apache/arrow/issues/40745),
[GH-40039](https://github.com/apache/arrow/issues/40039))
+
+## C++ notes
+For C++ notes refer to the full changelog.
+
+## Highlights
+
+- Initial support for the Azure Blob Storage has been added
([GH-18014](https://github.com/apache/arrow/issues/18014)).
+- Arrow C++ can now be built with Emscripten
([GH-37821](https://github.com/apache/arrow/pull/37821)) which lays the
foundation for running Arrow C++ under WASM runtimes and eventually
[PyArrow](https://github.com/apache/arrow/pull/37822) as well.
+- Arrow's filesystem modules have been separated out into individual libraries
and this change enables writing and registering custom filesystem
implementations ([GH-38309](https://github.com/apache/arrow/issues/38309)).
+- Conversion from `Table` and `RecordBatch` to a `Tensor` (not the same as
+[tensor extension
array](https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html#official-list))
+is being developed. Umbrella issue is created
([GH-40058](https://github.com/apache/arrow/issues/40058))
+and issues connected to the `RecordBatch` conversion are included in this
release
+([GH-40059](https://github.com/apache/arrow/issues/40059),
+[GH-40357](https://github.com/apache/arrow/issues/40357),
+[GH-40297](https://github.com/apache/arrow/issues/40297),
+[GH-40060](https://github.com/apache/arrow/issues/40060),
+[GH-40061](https://github.com/apache/arrow/issues/40061) and
+[GH-40866](https://github.com/apache/arrow/issues/40866)) which means
`RecordBatch` can now be
+converted to a column or row-major two-dimensional structure.
+
+## Breaking Changes
+
+- `Function::is_impure` has been renamed to `is_pure`
([GH-40607](https://github.com/apache/arrow/issues/40607)).
+
+## Compute
+
+### Bug Fixes
+
+- Fixed a potential crash when accessing the `true_count` property on a
BooleanArray ([GH-41016](https://github.com/apache/arrow/issues/41016)).
+
+### Performance improvements
+
+- Significantly improved performance of the take kernel on certain types of
inputs ([GH-40207](https://github.com/apache/arrow/issues/40207)).
+
+### Enhancements
+
+- Support for casting to and from half-float (float16) has been added
([GH-20213](https://github.com/apache/arrow/issues/20213)).
+- Added support for residual predicates to Swiss Join implementation
([GH-20339](https://github.com/apache/arrow/issues/20339)).
+- Expanded support to primitive filter implementation for all fixed-width
primitive types and take filter implementation for all well-known fixed-width
types ([GH-39740](https://github.com/apache/arrow/issues/39740)).
+- Added support for calling the `binary_slice` kernel on Fixed-Size Binary
Arrays ([GH-39231](https://github.com/apache/arrow/issues/39231)).
+- The cast kernel now supports casting from LargeString, Binary, and
LargeBinary to Dictionary
([GH-39463](https://github.com/apache/arrow/issues/39463)).
+- Fields of different decimal precision can now be used together in arithmetic
operations without an explicit cast beforehand.
([GH-40126](https://github.com/apache/arrow/issues/40126)).
+
+## Datasets
+
+- Improved backpressure handling in the Dataset Writer which can significantly
reduce memory usage for some use cases
([https://github.com/apache/arrow/pull/40722](https://github.com/apache/arrow/pull/40722)).
+
+## Parquet
+
+- Byte stream split encoding support has been added for FIXED_LEN_BYTE_ARRAY,
INT32, and INT64 which enables this encoding for half-float (float16) and
fixed-width decimal ([GH-39978](https://github.com/apache/arrow/issues/39978)).
+- Decoding boolean values has been made faster for a variety of cases
([GH-40872](https://github.com/apache/arrow/issues/40872)).
+
+## Filesystems
+
+### New Features
+
+- In addition to building the individual filesystem implementations as
separate modules, users can now write and register custom filesystem
implementations ([GH-38309](https://github.com/apache/arrow/issues/38309)).
+- A new environment variable, `AWS_ENDPOINT_URL_S3`, has been added which
allows separately overriding the endpoint for S3 operations alone
([GH-38663](https://github.com/apache/arrow/issues/38663)).
+
+### Bug Fixes
+
+- Fixed a bug in the S3 filesystem implementation that could cause a crash
when deleting an object having duplicate forward slashes in its name
([GH-38821](https://github.com/apache/arrow/issues/38821)).
+- Fixed a bug where `hash_mean` could silently overflow
([GH-38833](https://github.com/apache/arrow/issues/38833)).
+
+### Improvements
+
+- The S3 implementation now sets the content-type of directory-like objects to
application/x-directory to improve compatibility with other S3 tools
([GH-38794](https://github.com/apache/arrow/issues/38794)).
+- Repeated S3Client initialization is now roughly an order of magnitude faster
([GH-40299](https://github.com/apache/arrow/pull/40299)).
+- The MemoryPoolStats implementation has been reworked to re-order loads and
stores which may be an improvement for some allocation-heavy, multi-threaded
applications ([GH-40783](https://github.com/apache/arrow/issues/40783)).
+
+### Substrait
+
+- Support has been added to Substrait for a variety of Arrow types
([GH-40695](https://github.com/apache/arrow/issues/40695)).
+- substrait-cpp has been upgraded to 0.44
([GH-40695](https://github.com/apache/arrow/issues/40695)).
+
+## Development
+
+- Added support the mold and lld linkers for building Arrow C++
([GH-40394](https://github.com/apache/arrow/issues/40394),
[GH-40400](https://github.com/apache/arrow/issues/40400)).
+
+### Miscellaneous
+
+- Upgraded ORC to 2.0.0
([GH-40507](https://github.com/apache/arrow/issues/40507)).
+- Upgraded zstd to 1.5.6
([GH-40837](https://github.com/apache/arrow/pull/40837)).
+- Upgraded google benchmark to 1.8.3
([GH-39863](https://github.com/apache/arrow/issues/39863)).
+- Upgraded zlib 1.3.1
([GH-39876](https://github.com/apache/arrow/issues/39876)).
+- Various ToString methods now support an optional `show_metadata` argument
which will print metadata that may exist in nested types.
([GH-39864](https://github.com/apache/arrow/issues/39864)).
+
+## C# notes
+- IPC record batch compression has been implemented
[GH-24834](https://github.com/apache/arrow/issues/24834)
+- Optional materialization of C# string arrays is now supported
[GH-41047](https://github.com/apache/arrow/issues/41047)
+- A memory leak in the C Data interface has been fixed
[GH-40898](https://github.com/apache/arrow/issues/40898)
+- Various other bug fixes and improvements.
+
+
+## Go Notes
+
+* The Golang Arrow and Parquet libraries now require Go 1.21+
([GH-40733](https://github.com/apache/arrow/issues/40733))
+
+### Bug Fixes
+
+#### Arrow
+
+* FlightSQL Driver will now properly handle concurrent result sets instead of
pulling the entire result into memory
([GH-40089](https://github.com/apache/arrow/issues/40089))
+* FlightSQL driver will now correctly respect the `DriverConfig.TLSEnabled`
field ([GH-40097](https://github.com/apache/arrow/issues/40097))
+* Fixed a panic on 32-bit architectures
([GH-40672](https://github.com/apache/arrow/issues/40672))
+* Corrected a precision loss for Decimal types when converting to JSON
([GH-40693](https://github.com/apache/arrow/issues/40693))
+* Fixed an issue with `array.RecordBuilder` when using a NullType column
([GH-40719](https://github.com/apache/arrow/issues/40719))
+
+#### Parquet
+
+* Fixed panic when writing a DeltaBinaryPacked column containing only nulls
([GH-35718](https://github.com/apache/arrow/issues/35718))
+* Fixed a panic when writing a ListOf(DeltaBinaryPacked) field with no data
([GH-39309](https://github.com/apache/arrow/issues/39309))
+* Arrow DATE64 types will now be properly coerced into Parquet DATE[32-bit]
logical type ([GH-39456](https://github.com/apache/arrow/issues/39456))
+* Fixed the timezone semantics for timestamp conversion from Arrow to Parquet
([GH-39466](https://github.com/apache/arrow/issues/39466))
+* Corrected an inaccuracy with `RowGroupTotalCompressedBytes` and
`RowGroupTotalBytesWritten` for Parquet file writer
([GH-39870](https://github.com/apache/arrow/issues/39870))
+* Fixed the `TotalCompressedBytes` count when falling back to plain encoding
if a dictionary is too large
([GH-39921](https://github.com/apache/arrow/issues/39921))
+* Fixed a bug when reslicing a nullable dictionary in the chunked writer
([GH-39925](https://github.com/apache/arrow/issues/39925))
+
+### Enhancements
+
+#### Arrow
+
+* Users can now access the underlying `MemoTable` of a dictionary builder
([GH-38988](https://github.com/apache/arrow/issues/38988))
+* Added an option to provide a string replacer for CSV writing
([GH-39552](https://github.com/apache/arrow/issues/39552))
+* Flight: Cookies can be copied to another connection to reuse existing
credentials ([GH-39837](https://github.com/apache/arrow/issues/39837))
+* Flight: enable PollFlightInfo for Flight SQL clients/servers
([GH-39574](https://github.com/apache/arrow/issues/39574))
+* Added the ability to create a PreparedStatement from persisted data and
provided access for FlightSQL users to the PreparedStatement handle property
([GH-39774](https://github.com/apache/arrow/issues/39774)
[GH-39910](https://github.com/apache/arrow/issues/39910))
+* FlightRPC Session management extensions have been implemented
([GH-40155](https://github.com/apache/arrow/issues/40155))
+
+#### Parquet
+
+* Can now register new compression codecs for Parquet
([GH-40113](https://github.com/apache/arrow/issues/40113))
+* Parquet footers can be incrementally written without closing the file
([GH-40630](https://github.com/apache/arrow/issues/40630))
+
+## Java notes
+- A breaking change to support Java 9 modules has been implemented in this
release. [GH-39001](https://github.com/apache/arrow/issues/39001)
+- A new Float16 type has been added.
[GH-39680](https://github.com/apache/arrow/issues/39680)
+- Java 22 is supported.
[GH-40680](https://github.com/apache/arrow/issues/40680)
+- Various bug fixes and improvements.
+
+
+## JavaScript notes
+
+* Dates are now stored as TimestampMillisecond
+ ([GH-40892](https://github.com/apache/arrow/pull/40892))
+* Vectors created from typed arrays are now correctly not nullable and null
+ counts are now correct
+ ([GH-40852](https://github.com/apache/arrow/pull/40852))
+
+## Python notes
+
+Compatibility notes:
+* To ensure PyArrow compatibility with NumPy 2.0 umbrella issue has been
closed [GH-39532](https://github.com/apache/arrow/issues/39532) with last
issues included in 16.0.0 Arrow release
([GH-41098](https://github.com/apache/arrow/issues/41098),
[GH-39848](https://github.com/apache/arrow/issues/39848) and
[GH-40376](https://github.com/apache/arrow/issues/40376)).
+* We no longer use internals to create Block objects and started using new
pandas API with pandas version 3
[GH-35081](https://github.com/apache/arrow/issues/35081)
+* Pandas compatibility code has been simplified as old pandas and Python
versions are not supported anymore
[GH-40720](https://github.com/apache/arrow/issues/40720)
+* Deprecated `pyarrow.filesystem` legacy implementations have been removed
[GH-20127](https://github.com/apache/arrow/issues/20127)
+
+New features:
+* Converting Arrow `Table` and `RecordBatch` to a `Tensor` (not the same as
[tensor extension
array](https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html#official-list))
is being developed in Arrow C++ with bindings in Python. Umbrella issue:
([GH-40058](https://github.com/apache/arrow/issues/40058)). In current release
the option to convert a `RecordBatch` to `Tensor` with
`pyarrow.RecordBatch.to_tensor(...)` is added returning a row or column major
tensor with an option of [...]
+* `ListView` and `LargeListView` array formats are now supported by PyArrow
([GH-39812](https://github.com/apache/arrow/issues/39812),
[GH-39855](https://github.com/apache/arrow/issues/39855),
[GH-40205](https://github.com/apache/arrow/issues/40205),
[GH-41039](https://github.com/apache/arrow/issues/41039),
[GH-40266](https://github.com/apache/arrow/issues/40266))
+* `Binary` and `StringView` are now supported in PyArrow
([GH-39651](https://github.com/apache/arrow/issues/39651),
[GH-39852](https://github.com/apache/arrow/issues/39852),
[GH-40092](https://github.com/apache/arrow/issues/40092))
+* Final support for Run-End Encoded arrays in PyArrow has been included
(conversion to numpy and pandas
[GH-40659](https://github.com/apache/arrow/issues/40659), construction in
`pa.array(...)` [GH-40273](https://github.com/apache/arrow/issues/40273))
+* `AsofJoinNode` C++ functionality is now exposed in Python as a `join_asof`
[GH-34235](https://github.com/apache/arrow/issues/34235)
+* Minimal python bindings are added for AzureFilesystem
[GH-39968](https://github.com/apache/arrow/issues/39968)
+* `FixedSizeTensorScalar` class is added
[GH-37484](https://github.com/apache/arrow/issues/37484)
+
+Other improvements:
+* Add ChunkedArray import/export to/from C
[GH-39984](https://github.com/apache/arrow/issues/39984)
+* `pyarrow.Field` and `pyarrow.ChunkedArray` can now be constructed from
objects supporting the PyCapsule Arrow C Data Interface
[GH-38010](https://github.com/apache/arrow/issues/38010)
+* Requested_schema is supported in `__arrow_c_stream__` implementations
[GH-40066](https://github.com/apache/arrow/issues/40066)
+* Add low-level bindings for exporting/importing the C Device Interface
+ [GH-39979](https://github.com/apache/arrow/issues/39979)
+* Function to download and extract timezone database on a Windows machine is
added [GH-37328](https://github.com/apache/arrow/issues/37328)
+* Missing methods are added to `pyarrow.RecordBatch`
[GH-30915](https://github.com/apache/arrow/issues/30915)
+* Dictionary is now also accepted in `pyarrow.record_batch` factory function
(as in `pyarrow.table`) [GH-40291](https://github.com/apache/arrow/issues/40291)
+* Usage of scalar legacy cast has been removed
[GH-40023](https://github.com/apache/arrow/issues/40023)
+* Missing byte_width attribute are added to all DataType classes
[GH-39277](https://github.com/apache/arrow/issues/39277)
+* `FileInfo` instances can now be used to construct Dataset objects
[GH-40142](https://github.com/apache/arrow/issues/40142)
+* Support hashing for `FileMetaData` and `ParquetSchema`
[GH-39780](https://github.com/apache/arrow/issues/39780)
+* `force_virtual_addressing` is exposed in PyArrow
[GH-39779](https://github.com/apache/arrow/issues/39779)
+
+Relevant bug fixes:
+* Calling `pyarrow.dataset.ParquetFileFormat.make_write_options` as a class
method now returns a warning
[GH-39440](https://github.com/apache/arrow/issues/39440)
+* `ScalarMemoTable`is now initiated only when deduplication is enabled which
fixes large memory consumption in the other case
[GH-40316](https://github.com/apache/arrow/issues/40316)
+* Slicing an array backwards beyond the start doesn't include first item
([GH-38768](https://github.com/apache/arrow/issues/38768) and
[GH-40642](https://github.com/apache/arrow/issues/40642))
+* Memory leaks when creating Arrow array from Python list of dicts is fixed
[GH-37989](https://github.com/apache/arrow/issues/37989)
+* `FixedSizeListType` has not been considered as a nested type and is now
added to `_NESTED_TYPES`
[GH-40171](https://github.com/apache/arrow/issues/40171)
+* `max_chunksize` is now validated in `Table.to_batches`
[GH-39788](https://github.com/apache/arrow/issues/39788)
+* Raising `ValueError` on `_ensure_partitioning`in Dataset is fixed
[GH-39579](https://github.com/apache/arrow/issues/39579)
+
+* Python stacktrace is now attached to errors in `ConvertPyError`
[GH-37164](https://github.com/apache/arrow/issues/37164)
+
+## R notes
+
+* Arrow IPC streams (i.e., `write_ipc_stream`) can now be written to socket
+connections ([GH-38897](https://github.com/apache/arrow/pull/38897))
+* The `print()` output for `Dataset` and `Table` objects has been improved so
it
+now shows dimensions and truncates its output in the case of wide schemas
+([GH-38917](https://github.com/apache/arrow/pull/38917))
+* Various improvements and fixes to documentation, package build, and CI
systems
+
+For more on what’s in the 16.0.0 R package, see the [R changelog][4].
+
+## Ruby and C GLib notes
+
+### Ruby
+
+- Added support for customizing timestamp parsers.
+ [GH-40590](https://github.com/apache/arrow/issues/40590)
+
+### C GLib
+
+- Added support for time zone in `GArrowTimestampDataType`.
+ [GH-39702](https://github.com/apache/arrow/issues/39702)
+- Added missing compute function options.
+ [GH-40402](https://github.com/apache/arrow/issues/40402)
+ - `GArrowSplitPatternOptions`
+ - `GArrowStrftimeOptions`
+ - `GArrowStrptimeOptions`
+ - `GArrowStructFieldOptions`
+- Changed documentation generator to GI-DocGen from GTK-Doc.
+ [GH-39935](https://github.com/apache/arrow/issues/39935)
+- Added `GArrowTimestampParser`.
+ [GH-40438](https://github.com/apache/arrow/issues/40438)
+- Added support for customizing timestamp parsers.
+ [GH-40590](https://github.com/apache/arrow/issues/40590)
+
+## Rust notes
+
+The Rust projects have moved to separate repositories outside the
+main Arrow monorepo. For notes on the latest release of the Rust
+implementation, see the latest [Arrow Rust changelog][5].
+
+[1]: https://github.com/apache/arrow/milestone/59?closed=1
+[2]: {{ site.baseurl }}/release/16.0.0.html#contributors
+[3]: {{ site.baseurl }}/release/16.0.0.html#changelog
+[4]: {{ site.baseurl }}/docs/r/news/
+[5]: https://github.com/apache/arrow-rs/tags