amol- commented on a change in pull request #153: URL: https://github.com/apache/arrow-site/pull/153#discussion_r741716257
########## File path: _posts/2021-10-22-6.0.0-release.md ########## @@ -0,0 +1,174 @@ +--- +layout: post +title: "Apache Arrow 6.0.0 Release" +date: "2020-10-22 00:00:00 -0600" +author: pmc +categories: [release] +--- +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +<!-- + +To use this template: + +* Update all "XX" values with the appropriate numbers (you can get the resolved issues and contributors count from `_release/6.0.0.md`) +* Fill in the various sections below. Note that the audience is the broader user community, not Arrow developers, so please write clearly using terms they will understand and care about. Delete any sections that don't have any content (as in, there are no changes to announce) +* Delete this introductory comment + + --> + + +The Apache Arrow team is pleased to announce the 6.0.0 release. This covers +over XX months of development work and includes [**XX resolved issues**][1] +from [**XX distinct contributors**][2]. See the Install Page to learn how to +get the libraries for your platform. + +The release notes below are not exhaustive and only expose selected highlights +of the release. Many other bugfixes and improvements have been made: we refer +you to the [complete changelog][3]. + +## Community + +## Columnar Format Notes + +## Arrow Flight RPC notes + +GLib and Ruby have added bindings for Arrow Flight. + +While not part of the release, work is ongoing on Arrow Flight SQL, which defines a protocol for clients to communicate with SQL databases using Arrow Flight. For those interested in the project, please reach out on the [mailing list](https://arrow.apache.org/community/). +## C++ notes + +The month-day-nano interval type has been added (ARROW-13628). + +Various APIs, including extension types and scalars, are no longer experimental (ARROW-5244). + +Support for Visual Studio 2015 was dropped (ARROW-14070). + +### Compute Layer + +A basic in-memory query engine has been implemented and is accessible from the R bindings. Operations including filter, project, sort, equality joins, and various aggregations are supported. + +The following compute functions have been added: + +* aggregate functions: `approximate_median`, `count_distinct`, `max`, `min`, `product` +* hash aggregate functions: `hash_all`, `hash_any`, `hash_approximate_median`, `hash_count_distinct`, `hash_distinct`, `hash_max`, `hash_mean`, `hash_min`, `hash_product`, `hash_stdev`, `hash_variance` +* scalar arithmetic functions: `logb`, `round`, `round_to_multiple` +* scalar string functions: `ascii_capitalize`, `ascii_swapcase`, `ascii_title`, `utf8_capitalize`, `utf8_swapcase`, `utf8_title` +* scalar temporal functions: `assume_timezone`, `day_time_interval_between`, `days_between`, `hours_between`, `microseconds_between`, `milliseconds_between`, `minutes_between`, `month_day_nano_interval_between`, `month_interval_between`, `nanoseconds_between`, `quarters_between`, `seconds_between`, `strftime`, `us_week`, `week`, `weeks_between`, `years_between` +* other scalar functions: `choose`, `list_element` +* vector functions: `drop_null`, `select_k_unstable` + +In general, type support has been improved for most of the compute functions, but work here is ongoing, particularly around decimal support. + +Crashes have been fixed in particular cases for `take`, `filter`, `unique`, and `value_counts` (ARROW-13474, ARROW-13509, ARROW-14129). + +Hash aggregations (i.e. group by) supports scalar and array values (ARROW-13737, ARROW-14027). + +Temporal functions are now timezone-aware (e.g. when extracting the hour of a timestamp) (ARROW-12980). + +`count` can optionally count all values, not just null or non-null values (ARROW-13574). + +`fill_null` has been replaced by the more general `coalesce` (ARROW-7179). + +`is_null` can optionally consider NaN as null (ARROW-12959). + +Sorting has been optimized (ARROW-10898, ARROW-14165). Also, null values can now be sorted at either the beginning or the end (ARROW-12063). + +### CSV + +The CSV reader can read time32 and time64 types, and will infer time32 values for columns in the format "hh:mm" and "hh:mm:ss" (ARROW-11243). + +The decimal point can be customized when reading (ARROW-13421). + +The streaming reader will not unintentionally infer null-typed columns when using the various skip options (ARROW-13441). + +If a row has an incorrect number of columns, now the row can be skipped instead of raising an error (ARROW-12673). + +The option `quoted_strings_can_be_null` applies to all column types now, not just strings (ARROW-13580). When quoting is disabled entirely, the reader now takes advantage of this to improve performance (ARROW-14150). + +A CSVWriter object is now exposed, allowing for incremental writing (ARROW-11828). Dates can now be written (ARROW-12540). + +### Dataset Layer + +The dataset writer was refactored, and now supports more options, including a limit on the number of files open at once, compatibility with the async scanner, a limit on the number of rows written per file, and control over what to do when files already exist in the target directory (ARROW-13650). Additionally, the query engine can feed into the dataset writer as a sink (ARROW-13542). + +The asynchronous scanner now properly respects backpressure (ARROW-13611, ARROW-14192), as does the writer (ARROW-14191). + +ORC datasets are supported (ARROW-13572) with support for column projection pushdown (ARROW-13797). + +The Parquet/IPC format readers now respect the batch_size scanner option (ARROW-14024). Also, the Parquet reader now properly implements readahead for better performance (ARROW-14026). + +### IO and Filesystem Layer + +The retry strategy of S3FileSystem can be customized (ARROW-13508). When writing to an existing bucket as a user with limited permissions, Arrow will no longer emit a spurious "Access Denied" error (ARROW-13685). + +On MacOS with NFS mounts, a "[errno 25] Inappropriate ioctl for device" error was fixed (ARROW-13983). + +The basics of a Google Cloud Storage filesystem have been added; work is in progress for full support (ARROW-8147, ARROW-14222, ARROW-14223, ARROW-14232, ARROW-14236, ARROW-14345, ARROW-14157). + +### JSON + +A crash was fixed when duplicate keys were present (ARROW-14109). + +### Parquet + +Written min/max and null_count statistics for dictionary types were corrected (ARROW-11634, ARROW-12513). + +An error with large files when built with Thrift 0.14 was fixed (ARROW-13655). + +The ParquetVersion enum was updated with more values to support finer-grained Parquet format version selection (ARROW-13794). + +Writer performance was improved by avoiding repeated dynamic casts (ARROW-13965). +## C# notes + +## Go notes + +## Java notes + +## JavaScript notes + +## Python notes +* Many new `pyarrow.compute` functions are available (see the C++ notes above for more details), and introspection of the functions was improved so that they look more like standard Python functions. +* All Python functions and classes should now have documented parameters in the API reference. +* SIMD optimization is now enabled in M1 wheels +* Wheels are now built for more Python versions on M1 systems. +* PyArrow is now compatible with Python 3.10 +* Creating Arrow arrays now supports more than just numpy arrays as masks +* Printing Tables now previews the values in the columns +* `copy_files` is now available in Python +* Datasets now support ORC files +* Sets are now supported when building arrays or converting from pandas. +* 39 bugs have been fixed. + +## R notes + +## Ruby and C GLib notes + Review comment: @mrkn friendly reminder to update the Ruby release notes as I think it's the last part pending for the blog post to be complete :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org