[GitHub] [arrow-site] thisisnic commented on a diff in pull request #382: [Website] Version 13.0.0 blog post

via GitHub Fri, 21 Jul 2023 03:43:30 -0700


thisisnic commented on code in PR #382:
URL: https://github.com/apache/arrow-site/pull/382#discussion_r1270533036



##########
_posts/2023-07-17-13.0.0-release.md:
##########
@@ -0,0 +1,315 @@
+---
+layout: post
+title: "Apache Arrow 13.0.0 Release"
+date: "2023-07-17 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+
+The Apache Arrow team is pleased to announce the 13.0.0 release. This covers
+over 3 months of development work and includes [**XXX resolved issues**][1]
+from [**YYY distinct contributors**][2]. See the [Install 
Page](https://arrow.apache.org/install/)
+to learn how to get the libraries for your platform.
+
+The release notes below are not exhaustive and only expose selected highlights
+of the release. Many other bugfixes and improvements have been made: we refer
+you to the [complete changelog][3].
+
+## Community
+
+Since the 12.0.0 release, Marco Neumann, Gang Wu, Mehmet Ozan Kabak and Kevin 
Gurney
+have been invited to be committers.
+Matt Topol, Jie Wen, Ben Baumgold and Dewey Dunnington have joined the
+Project Management Committee (PMC).
+
+Thanks for your contributions and participation in the project!
+
+## Columnar Format Notes
+
+The [run-end encoded 
layout](https://arrow.apache.org/docs/dev/format/Columnar.html#run-end-encoded-layout)
+has been added.  This layout can allow data with long runs of duplicate values 
to be encoded and processed
+efficiently.  Initial support has been added for 
[C++](https://github.com/apache/arrow/pull/14179) and
+[Go](https://github.com/apache/arrow/pull/14223).
+
+### C Device Data Interface
+
+An **experimental** new specification, the
+[C Device Data 
Interface](https://arrow.apache.org/docs/dev/format/CDeviceDataInterface.html),
+has been accepted for inclusion (GH-34971). It builds on the existing
+C Data Interface to provide a runtime-agnostic zero-copy sharing mechanism
+for Arrow data residing on non-CPU devices.
+
+Reference implementations of the C Device Data Interface will progressively
+be added to the standard Arrow libraries after the 13.0.0 release.
+
+## Arrow Flight RPC notes
+
+Support for flagging ordered result sets to clients is now added. 
([#34852](https://github.com/apache/arrow/issues/34852))
+
+gRPC 1.30 is now the minimum supported version in C++/Python/R/etc. 
([#34679](https://github.com/apache/arrow/issues/36479)) 
+
+In C++, various methods now receive a full `ServerCallContext` 
([#35442](https://github.com/apache/arrow/issues/35442), 
[#35377](https://github.com/apache/arrow/issues/35377)) and the context now 
exposes headers sent by the client 
([#35375](https://github.com/apache/arrow/issues/35375)).
+
+## C++ notes
+
+### Building
+
+CMake 3.16 or later is now required for building Arrow C++ (GH-34921).
+
+Optimizations are not disabled anymore when the `RelWithDebInfo` build type
+is selected (GH-35850). Furthermore, compiler flags can now properly be
+customized per-build type using `ARROW_C_FLAGS_DEBUG`, `ARROW_CXX_FLAGS_DEBUG`
+and related variables (GH-35870).
+
+### Acero
+
+Handling of unaligned buffers is input nodes can be configured programmatically
+or by setting the environment variable `ACERO_ALIGNMENT_HANDLING`. The default
+behavior is to warn when an unaligned buffer is detected (GH-35498).
+
+### Compute
+
+Several new functions have been added:
+* aggregate functions "first", "last", "first_last" (GH-34911);
+* vector functions "cumulative_prod", "cumulative_min", "cumulative_max" 
(GH-32190);
+* vector function "pairwise_diff" (GH-35786).
+
+Sorting now works on dictionary arrays, with a much better performance than
+the naive approach of sorting the decoded dictionary (GH-29887). Sorting also
+works on struct arrays, and nested sort keys are supported using `FieldRed` 
(GH-33206).
+
+The `check_overflow` option has been removed from `CumulativeSumOptions` as
+it was redundant with the availability of two different functions:
+"cumulative_sum" and "cumulative_sum_checked" (GH-35789).
+
+Run-end encoded filters are efficiently supported (GH-35749).
+
+Duration types are supported with the "is_in" and "index_in" functions 
(GH-36047).
+They can be multiplied with all integer types (GH-36128).
+
+"is_in" and "index_in" now cast their inputs more flexibly: they first attempt
+to cast the value set to the input type, then in the other direction if the
+former fails (GH-36203).
+
+Multiple bugs have been fixed in "utf8_slice_codeunits" when the `stop` option
+is omitted (GH-36311).
+
+### Dataset
+
+A custom schema can now be passed when writing a dataset (GH-35730). The custom
+schema can alter nullability or metadata information, but is not allowed to
+change the datatypes written.
+
+### Filesystems
+
+The S3 filesystem now writes files in equal-sized chunks, for compatibility 
with
+Cloudflare's "R2" Storage (GH-34363).
+
+A long-standing issue where S3 support could crash at shutdown because of 
resources
+still being alive after S3 finalization has been fixed (GH-36346). Now, 
attempts
+to use S3 resources (such as making filesystem calls) after S3 finalization 
should
+result in a clean error.
+
+The GCS filesystem accepts a new option to set the project id (GH-36227).
+
+### IPC
+
+Nullability and metadata information for sub-fields of map types is now 
preserved
+when deserializing Arrow IPC (GH-35297).
+
+### Orc
+
+The Orc adapter now maps Arrow field metadata to Orc type attributes when 
writing,
+and vice-versa when reading (GH-35304).
+
+### Parquet
+
+It is now possible to write additional metadata while a `ParquetFileWriter` is
+open (GH-34888).
+
+Writing a page index can be enabled selectively per-column (GH-34949).
+In addition, page header statistics are not written anymore if the page
+index is enabled for the given column (GH-34375), as the information would
+be redundant and less efficiently accessed.
+
+Parquet writer properties allow specifying the sorting columns (GH-35331).
+The user is responsible for ensuring that the data written to the file
+actually complies with the given sorting.
+
+CRC computation has been implemented for v2 data pages (GH-35171).
+It was already implemented for v1 data pages.
+
+Writing compliant nested types is now enabled by default (GH-29781). This
+should not have any negative implication.
+
+Attempting to load a subset of an Arrow extension type is now forbidden
+(GH-20385). Previously, if an extension type's storage is nested (for example
+a "Point" extension type backed by a `struct<x: float64, y: float64>`),
+it was possible to load selectively some of the columns of the storage type.
+
+### Substrait
+
+Support for various functions has been added: "stddev", "variance", "first",
+"last" (GH-35247, GH-35506).
+
+Deserializing sorts is now supported (GH-32763). However, some features,
+such as clustered sort direction or custom sort functions, are not
+implemented.
+
+### Miscellaneous
+
+`FieldRef` sports additional methods to get a flattened version of nested
+fields (GH-14946). Compared to their non-flattened counterparts,
+the methods `GetFlattened`, `GetAllFlattened`, `GetOneFlattened` and
+`GetOneOrNoneFlattened` combine a child's null bitmap with its ancestors'
+null bitmaps such as to compute the field's overall logical validity bitmap.
+
+In other words, given the struct array `[null, {'x': null}, {'x': 5}]`,
+`FieldRef("x")::Get` might return `[0, null, 5]`
+while `FieldRef("y")::GetFlattened` will *always* return `[null, null, 5]`.
+
+`Scalar::hash()` has been fixed for sliced nested arrays (GH-35360).
+
+A new floating-point to decimal conversion algorithm exhibits much better
+precision (GH-35576).
+
+It is now possible to cast between scalars of different list-like types
+(GH-36309).
+
+## C# notes
+
+## Go notes
+
+### Enhancements
+
+#### Arrow
+
+* Compute arithmetic functions are now available for Float16 
([GH-35162](https://github.com/apache/arrow/issues/35162))
+* Float16, Large* and Fixed types are all now supported by the CSV 
reader/writer ([GH-36105](https://github.com/apache/arrow/issues/36105) and 
[GH-36141](https://github.com/apache/arrow/issues/36141))
+* CSV Reader uses `AppendValueFromString` for extension types and properly 
reads empty values as null 
([GH-35188](https://github.com/apache/arrow/issues/35188) and 
[GH-35190](https://github.com/apache/arrow/issues/35190))
+* [Substrait](https://github.com/substrait-io/substrait-go) expressions can 
now be executed using the Compute library 
([GH-35652](https://github.com/apache/arrow/issues/35652))
+* You can now read back values from Dictionary Builders before finishing the 
array ([GH-35711](https://github.com/apache/arrow/issues/35711))
+* `MapType.ValueField` and `MapType.ValueType` are now deprecated in favor of 
`MapType.Elem().(*StructType)` 
([GH-35909](https://github.com/apache/arrow/issues/35909))
+* Multiple equality functions which have been deprecated since v9 have  now 
been removed (Such as `array.ArraySliceEqual` in favor of `array.SliceEqual`) 
([GH-36198](https://github.com/apache/arrow/issues/36198))
+* `ValueStr` method on Timestamp arrays now includes the zone in the output 
([GH-36568](https://github.com/apache/arrow/issues/36568))
+* *BREAKING CHANGE* `FixedSizeListBuilder.AppendNull` no longer requires 
manually appending nulls to the underlying list  
([GH-35482](https://github.com/apache/arrow/issues/35482))
+
+#### Flight
+
+* FlightSQL driver supports non-prepared queries now 
([GH-35136](https://github.com/apache/arrow/issues/35136))
+
+#### Parquet
+
+* Error messages in row group writer have been improved 
([GH-36319](https://github.com/apache/arrow/issues/36319))
+
+### Bug Fixes
+
+* Cross architecture build failures with v12.0.1 have been fixed 
([GH-36052](https://github.com/apache/arrow/issues/36052))
+
+#### Arrow
+
+* It is now possible to build the Arrow Go lib using tinygo for building 
smaller WASM binaries ([GH-32832](https://github.com/apache/arrow/issues/32832))
+* `Fields` method for Schema and StructType now returns a copy of the slice to 
ensure immutability ([GH-35306](https://github.com/apache/arrow/issues/35306) 
and [GH-35866](https://github.com/apache/arrow/issues/35866))
+* `array.ApproxEqual` for Maps now allows entries for a given element to be 
presented in any order 
([GH-35828](https://github.com/apache/arrow/issues/35828))
+* Fix issues with decimal256 arrays 
([GH-35911](https://github.com/apache/arrow/issues/35911), 
[GH-35965](https://github.com/apache/arrow/issues/35965), and 
[GH-35975](https://github.com/apache/arrow/issues/35975))
+* StructType now allows duplicate field names correctly 
([GH-36014](https://github.com/apache/arrow/issues/36014))
+
+#### Flight
+
+* Fix crash in client middleware 
([GH-35240](https://github.com/apache/arrow/issues/35240))
+
+#### Parquet
+
+* Various memory leaks addressed in pqarrow package 
([GH-35015](https://github.com/apache/arrow/issues/35015))
+* Fixed panic for `ListOf(types)` if null 
([GH-35684](https://github.com/apache/arrow/issues/35684))
+
+
+## Java notes
+
+The JNI bindings for Arrow Dataset now support execute 
[Substrait](https://substrait.io/) plans via the 
[Acero](https://arrow.apache.org/docs/dev/cpp/streaming_execution.html) query 
engine. ([#34223](https://github.com/apache/arrow/issues/34223))
+
+Arrow packages that depend on Netty (most notably, `arrow-memory-netty`, but 
also Arrow Flight) now require Netty 4.1.94.Final at a minimum. 
([#36209](https://github.com/apache/arrow/issues/36209))
+
+`VectorSchemaRoot#slice` now always makes a copy, including when the slice 
covers all rows (previously it did not make a copy in this case). This is a 
potentially-breaking change if your application depended on the old behavior. 
([#35275](https://github.com/apache/arrow/issues/35275))
+
+Debug info for allocations is no longer automatically enabled when assertions 
are enabled (e.g. when running unit tests). Instead, support must be explicitly 
enabled. This is not quite a breaking change, but may be surprising if you are 
used to using this information while debugging tests. However, performance 
should be greatly improved while running tests. 
([#34338](https://github.com/apache/arrow/issues/34338))
+
+Support for the upcoming Java 21 was added, though we do not yet test this in 
CI ([#5053](https://github.com/apache/arrow/issues/35053)). The JNI bindings 
for Arrow Dataset now expose JSON support 
([#36421](https://github.com/apache/arrow/issues/36421)). Dictionary 
replacement is now supported when writing the IPC stream format 
([#18547](https://github.com/apache/arrow/issues/18547)).
+
+## JavaScript notes
+
+## Python notes
+
+Compatibility notes:
+
+* The default format version for Parquet has been bumped from 2.4 to 2.6 
[GH-35746](https://github.com/apache/arrow/issues/35746)
+* Support for Python 3.7 is dropped 
[GH-34788](https://github.com/apache/arrow/issues/34788)
+
+New features:
+
+* Conversion to non-nano datetime64 for pandas >= 2.0 is now supported 
[GH-33321](https://github.com/apache/arrow/issues/33321)
+* Write page index is now supported 
[GH-36284](https://github.com/apache/arrow/issues/36284)
+* Bindings for reading JSON format in Dataset are added 
[GH-34216](https://github.com/apache/arrow/issues/34216)
+* `keys_sorted` property of MapType is now exposed 
[GH-35112](https://github.com/apache/arrow/issues/35112)
+
+Other improvements:
+
+* Common python functionality between `Table` and `RecordBatch` classes has 
been consolidated ( [GH-36129](https://github.com/apache/arrow/issues/36129), 
[GH-35415](https://github.com/apache/arrow/issues/35415), 
[GH-35390](https://github.com/apache/arrow/issues/35390), 
[GH-34979](https://github.com/apache/arrow/issues/34979), 
[GH-34868](https://github.com/apache/arrow/issues/34868), 
[GH-31868](https://github.com/apache/arrow/issues/31868))
+* Some functionality for `FixedShapeTensorType` has been improved 
(`__reduce__` [GH-36038](https://github.com/apache/arrow/issues/36038), 
picklability [GH-35599](https://github.com/apache/arrow/issues/35599))
+* Pyarrow scalars can now be accepted in the `array` constructor 
[GH-21761](https://github.com/apache/arrow/issues/21761)
+* DataFrame Interchange Protocol implementation and usage is now documented 
[GH-33980](https://github.com/apache/arrow/issues/33980)
+* Conversion between Arrow and Pandas for map/pydict now has enhanced support 
[GH-34729](https://github.com/apache/arrow/issues/34729)
+* Usability of `pc.map_lookup` / `MapLookupOptions` is improved 
[GH-36045](https://github.com/apache/arrow/issues/36045)
+* `zero_copy_only` keyword can now also be accepted in 
`ChunkedArray.to_numpy()` 
[GH-34787](https://github.com/apache/arrow/issues/34787)
+* Python C++ codebase now has linter support in Archery and the CI 
[GH-35485](https://github.com/apache/arrow/issues/35485)
+
+Relevant bug fixes:
+
+* `__array__` numpy conversion for Table and RecordBatch is now corrected so 
that `np.asarray(pa.Table)` doesn’t return a transposed result 
[GH-34886](https://github.com/apache/arrow/issues/34886)
+* `parquet.write_to_dataset` doesn't create empty files for non-observed 
dictionary (category) values anymore 
[GH-23870](https://github.com/apache/arrow/issues/23870)
+* Dataset writer now also correctly follows default Parquet version of 2.6 
[GH-36537](https://github.com/apache/arrow/issues/36537)
+* Comparing `pyarrow.dataset.Partitioning` with other type is now correctly 
handled [GH-36659](https://github.com/apache/arrow/issues/36659)
+* Pickling of pyarrow.dataset PartitioningFactory objects is now supported 
[GH-34884](https://github.com/apache/arrow/issues/34884)
+* None schema is now disallowed in parquet writer 
[GH-35858](https://github.com/apache/arrow/issues/35858)
+* `pa.FixedShapeTensorArray.to_numpy_ndarray` is not failing on sliced arrays 
[GH-35573](https://github.com/apache/arrow/issues/35573)
+* Halffloat type is now supported in the conversion from Arrow list to pandas 
[GH-36168](https://github.com/apache/arrow/issues/36168)
+* `__from_arrow__` is now also implemented for `Array.to_pandas` for pandas 
extension data types [GH-36096](https://github.com/apache/arrow/issues/36096)
+
+
+## R notes

Review Comment:
   ```suggestion
   ## R notes
   
   New features:
   
   * `open_dataset()` now works with ND-JSON files 
[GH-35055](https://github.com/apache/arrow/issues/35055)
   * Calling `schema()` on multiple Arrow objects now returns the object's 
schema [GH-35543](https://github.com/apache/arrow/issues/35543)
   * dplyr `.by`/`by` argument now supported in arrow implementation of dplyr 
verbs  [GH-35667](https://github.com/apache/arrow/issues/35667)
   
   Other improvements:
   * Convenience function `arrow_array()` can be used to create Arrow Arrays 
[GH-36381](https://github.com/apache/arrow/issues/36381)
   * Convenience function `scalar()` can be used to create Arrow Scalars  
[GH-36265](https://github.com/apache/arrow/issues/36265)
   * Prevent crashed when passing data between arrow and duckdb by always 
calling `RecordBatchReader::ReadNext()` from DuckDB from the main R thread 
[GH-36307](https://github.com/apache/arrow/issues/36307)
   * Issue a warning for `set_io_thread_count()` with `num_threads` < 2 
[GH-36304](https://github.com/apache/arrow/issues/36304)
   * Ensure missing grouping variables are added to the beginning of the 
variable list [GH-36305](https://github.com/apache/arrow/issues/36305)
   * CSV File reader options class objects can print the selected values 
[GH-35955](https://github.com/apache/arrow/issues/35955)
   * Schema metadata can be set as a named character vector 
[GH-35954](https://github.com/apache/arrow/issues/35954)
   * Ensure that the RStringViewer helper class does not own any Array 
references [GH-35812](https://github.com/apache/arrow/issues/35812)
   * `strptime()` in arrow will return a timezone-aware timestamp if `%z` is 
part of the format string 
[GH-35671](https://github.com/apache/arrow/issues/35671)
   * Column ordering when combining `group_by()` and `across()` now matches 
dplyr [GH-35473](https://github.com/apache/arrow/issues/35473)
   * Link to correct version of OpenSSL when using autobrew 
[GH-36551](https://github.com/apache/arrow/issues/36551)
   * Require cmake 3.16 in bundled build script 
[GH-36321](https://github.com/apache/arrow/issues/36321)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-site] thisisnic commented on a diff in pull request #382: [Website] Version 13.0.0 blog post

Reply via email to