sgilmore10 commented on code in PR #668: URL: https://github.com/apache/arrow-site/pull/668#discussion_r2210812417
########## _posts/2025-07-16-21.0.0-release.md: ########## @@ -0,0 +1,214 @@ +--- +layout: post +title: "Apache Arrow 21.0.0 Release" +date: "2025-07-16 00:00:00" +author: pmc +categories: [release] +--- +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +The Apache Arrow team is pleased to announce the 21.0.0 release. This release +covers over 2 months of development work and includes [**1234 resolved +issues**][1] on [**1234 distinct commits**][2] from [**1234 distinct +contributors**][2]. See the [Install Page](https://arrow.apache.org/install/) to +learn how to get the libraries for your platform. + +The release notes below are not exhaustive and only expose selected highlights +of the release. Many other bugfixes and improvements have been made: we refer +you to the [complete changelog][3]. + +## Community + +Since the 20.0.0 release, Alenka Frim has been invited to join the Project +Management Committee (PMC). + +Thanks for your contributions and participation in the project! + +## Release Highlights + +## Columnar Format + +## Arrow Flight RPC Notes + +## C++ Notes + +### Acero + +### Compute + +The Cast function is now able to reorder fields when casting from one +struct type to another; the fields are matched by name, not by index (GH-45028). + +Many compute kernels have been moved into a separate, optional, shared library +(GH-25025). This allows reducing the Arrow C++ distribution size when the +compute functionality is not being used by the application. Note that some +compute functions, such as the Cast function, will still be built for internal +use in various Arrow components. + +Better half-float support has been added to some compute functions: +`is_nan`, `is_inf`, `is_finite`, `negate`, `negate_checked`, `sign` (GH-45083); +`if_else`, `case_when`, `coalesce`, `choose`, `replace_with_mask`, `fill_null_forward`, +`fill_null_backward` (GH-37027); `run_end_encode`, `run_end_decode` (GH-46285). + +Better decimal32 and decimal64 support has been added to some compute functions: +`run_end_encode`, `run_end_decode` (GH-46285). + +A new function `utf8_zero_fill` acts like Python's `str.zfill` method by providing +a left-padding function that preserves the optional leading plus/minus sign (GH-46683). + +Decimal sum aggregation now produces a decimal result with an increased precision +in order to reduce the risk of overflowing the result type (GH-35166). + +### CSV + +Reading Duration columns is now supported (GH-40278). + +### Dataset + +It is now possible to preserve order when writing a dataset multi-threaded. +The feature is disabled by default (GH-26818). + +### Filesystems + +The S3 filesystem can optionally be built into a separate DLL (GH-40343). + +### Parquet + +#### Encryption + +A new `SecureString` class must now be used to communicate sensitive data (such as +secret keys) with Parquet encryption APIs. This class automatically wipes its +contents from memory when destroyed, unlike regular `std::string` (GH-31603). + +#### Type support + +The new VARIANT logical type is supported at a low level, and an extension type +`parquet.variant` is added to reflect such columns when reading them to Arrow +(GH-45937). + +The UUID logical type is automatically converted to/from the `arrow.uuid` +canonical extension type when reading or writing Parquet data, respectively. + +The GEOMETRY and GEOGRAPHY logical types are supported (GH-45522). They are +automatically converted to/from the corresponding GeoArrow extension type, +if it has been registered by GeoArrow. Geospatial column statistics are also +supported. + +It is now possible to read BYTE_ARRAY columns directly as LargeBinary or BinaryView, +without any intermediate conversion from Binary. Similarly, those types can be +written directly to Parquet (GH-43041). This allows bypassing the 2 GiB data +per chunk limitation of the Binary type, and can also improve performance. +This also applies to String types when a Parquet column has the STRING logical type. + +Similarly, LIST columns can now be read directly as LargeList rather than List. +This allows bypassing the 2^31 values per chunk limitation of regular List types +(GH-46676). + +#### Other Parquet improvements + +A new feature named Content-Defined Chunking improves deduplication of Parquet +files with mostly identical contents, by choosing data page boundaries based on +actual contents rather than a number of values. For that, it uses a rolling hash +function, and the min and max chunk size can be chosen. The feature is disabled by +default and can be enabled on a per-file basis in the Parquet `WriterProperties` +(GH-45750). + +The `EncodedStatistics` of a column chunk are publicly exposed in `ColumnChunkMetaData` +and can be read faster than if decoded as `Statistics` (GH-46462). + +SIMD optimizations for the BYTE_STREAM_SPLIT have been improved (GH-46788). + +Reading FIXED_LEN_BYTE_ARRAY data has been made significantly faster (up to 3x +faster on some benchmarks). This benefits logical types such as FLOAT16 (GH-43891). + +### Miscellaneous C++ changes + +The `ARROW_USE_PRECOMPILED_HEADERS` build option was removed, as `CMAKE_UNITY_BUILD` +usually provides more benefits while requiring less maintenance. + +New data creation helpers `ArrayFromJSONString`, `ChunkedArrayFromJSONString`, +`DictArrayFromJSONString`, `ScalarFromJSONString` and `DictScalarFromJSONString` +are now exposed publicly. While not as high-performing as `BufferBuilder` and +the concrete `ArrayBuilder` subclasses, they allow easy creation of test +or example data, for example: +```c++ + ARROW_ASSIGN_OR_RAISE( + auto string_array, + arrow::ArrayFromJSONString(arrow::utf8(), R"(["Hello", "World", null])")); + ARROW_ASSIGN_OR_RAISE( + auto list_array, + arrow::ArrayFromJSONString(arrow::list(arrow::int32()), + "[[1, null, 2], [], [3]]")); +``` + +Some APIs were changed to accept `std::string_view` instead of `const std::string&`. +Most uses of those APIs should not be affected (GH-46551). + +A new pretty-print option allows limiting element size when printing string +or binary data (GH-46403). + +It is now possible to export `Tensor` data using +[DLPack](https://dmlc.github.io/dlpack/latest/) (GH-39294). + +Half-float arrays can be properly diff'ed and pretty-printed (GH-36753). + +Some header files in `arrow/util` that were not supposed to be exposed are +now made internal (GH-46459). + +## C# Notes + +- TODO: Note about future repo move plan. "This is the final release of the C# implementation of Arrow as part of the monorepo...etc. + +## Java, JavaScript, Go, and Rust Notes + +The Java, JavaScript, Go, and Rust Go projects have moved to separate repositories outside +the main Arrow [monorepo](https://github.com/apache/arrow). + +- For notes on the latest release of the [Java +implementation](https://github.com/apache/arrow-java), see the latest [Arrow +Java changelog][7]. +- For notes on the latest release of the [JavaScript +implementation](https://github.com/apache/arrow-js), see the latest [Arrow +JavaScript changelog][8]. +- For notes on the latest release of the [Rust + implementation](https://github.com/apache/arrow-rs) see the latest [Arrow Rust + changelog][5]. +- For notes on the latest release of the [Go +implementation](https://github.com/apache/arrow-go), see the latest [Arrow Go +changelog][6]. + +## Linux Packaging Notes + +## Python Notes Review Comment: ```suggestion ## MATLAB Notes ### New Features Added support for creating an `arrow.tabular.Tabule` from a list of `arrow.tabular.RecordBatch` instances ([GH-46877](https://github.com/apache/arrow/issues/46877)) ### Packaging The MLTBX available in apache/arrow's GitHub Releases area was built against MATLAB R2025a. ## Python Notes ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org