This is an automated email from the ASF dual-hosted git repository. timsaucer pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/datafusion-site.git
The following commit(s) were added to refs/heads/main by this push: new 4df2e5d datafusion-python 46.0.0 announcement (#65) 4df2e5d is described below commit 4df2e5de3bb6c6ff9482e8acf90adf28b81d5daa Author: Tim Saucer <timsau...@gmail.com> AuthorDate: Mon Apr 7 08:07:10 2025 -0400 datafusion-python 46.0.0 announcement (#65) * datafusion-python-44 announcement * Update title * Updating blog post with links, but still needs to add text. Also need to update author list near end * Make all links at least work so CI will pass * Adding additional text to the release announcement * Heading level wrong * Minor formatting to match other posts, fixed one link * Respond to suggestions from code review * Remove codehilite since it doesn't play well with hilight.js and the latter has broader language support --- .../blog/2025-03-30-datafusion-python-46.0.0.md | 300 +++++++++++++++++++++ .../python-datafusion-46.0.0/html_rendering.png | Bin 0 -> 189126 bytes pelicanconf.py | 1 - 3 files changed, 300 insertions(+), 1 deletion(-) diff --git a/content/blog/2025-03-30-datafusion-python-46.0.0.md b/content/blog/2025-03-30-datafusion-python-46.0.0.md new file mode 100644 index 0000000..8252bbd --- /dev/null +++ b/content/blog/2025-03-30-datafusion-python-46.0.0.md @@ -0,0 +1,300 @@ +--- +layout: post +title: Apache DataFusion Python 46.0.0 Released +date: 2025-03-30 +author: timsaucer +categories: [release] +--- +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + + +We are happy to announce that [datafusion-python 46.0.0] has been released. This release +brings in all of the new features of the core [DataFusion 46.0.0] library. Since the last +blog post for [datafusion-python 43.1.0], a large number of improvements have been made +that can be found in the [changelogs]. + +We highly recommend reviewing the upstream [DataFusion 46.0.0] announcement. + +[DataFusion 46.0.0]: https://datafusion.apache.org/blog/2025/03/24/datafusion-46.0.0 +[datafusion-python 43.1.0]: https://datafusion.apache.org/blog/2024/12/14/datafusion-python-43.1.0/ +[datafusion-python 46.0.0]: https://pypi.org/project/datafusion/46.0.0/ +[changelogs]: https://github.com/apache/datafusion-python/tree/main/dev/changelog + +## Easier file reading + +In these releases we have introduced two new ways to more easily read files into +DataFrames. + +PR [#982] introduced a series of easier read functions for Parquet, JSON, CSV, and +AVRO files. This introduces a concept of a global context that is available by +default when using these methods. Now instead of creating a default Session +Context and then calling the read methods, you can simply import these read +alternative methods and begin working with your DataFrames. Below is an example of +how easy to use this new approach is. + +```python +from datafusion.io import read_parquet +df = read_parquet(path="./examples/tpch/data/customer.parquet") +``` + +PR [#980] adds a method for setting up a session context to use URL tables. With +this enabled, you can use a path to a local file as a table name. An example +of how to use this is demonstrated in the following snippet. + +```python +import datafusion +ctx = datafusion.SessionContext().enable_url_table() +df = ctx.table("./examples/tpch/data/customer.parquet") +``` + +[#982]: https://github.com/apache/datafusion-python/pull/982 +[#980]: https://github.com/apache/datafusion-python/pull/980 + +## Registering Table Views + +DataFusion supports registering a logical plan as a view with a session context. This +allows creating views in one part of your work flow and passinng the session +context to other places where that logical plan can be reused. This is an useful +feature for building up complex workflows and for code clarity. PR [#1016] enables this +feature in `datafusion-python`. + +For example, supposing you have a DataFrame called `df1`, you could use this code snippet +to register the view and then use it in another place: + +```python +ctx.register_view("view1", df1) +``` + +And then in another portion of your code which has access to the same session context +you can retrive the DataFrame with: + +``` +df2 = ctx.table("view1") +``` + +[#1016]: https://github.com/apache/datafusion-python/pull/1016 + +## Asynchronous Iteration of Record Batches + +Retrieving a `RecordBatch` from a `RecordBatchStream` was a synchronous call, which would +require the end user's code to wait for the data retrieval. This is described in +[Issue 974]. We continue to support this as a synchronous iterator, but we have also added +in the ability to retrieve the `RecordBatch` using the Python asynchronous `anext` +function. + +[Issue 974]: https://github.com/apache/datafusion-python/issues/974 + +## Default ZSTD Compression for Parquet files + +With PR [#981], we change the saving of Parquet files to use zstd compression by default. +Previously the default was uncompressed, causing excessive disk storage. Zstd is an +excellent compression scheme that balances speed and compression ratio. Users can still +save their Parquet files uncompressed by passing in the appropriate value to the +`compression` argument when calling `DataFrame.write_parquet`. + +[#981]: https://github.com/apache/datafusion-python/pull/981 + +## UDF Decorators + +In PRs [#1040] and [#1061] we add methods to make creating user defined functions +easier and take advantage of Python decorators. With these PRs you can save a step +from defining a method and then defining a udf of that method. Instead you can +simply add the appropriate `udf` decorator. Similar methods exist for aggregate +and window user defined functions. + +```python +@udf([pa.int64(), pa.int64()], pa.bool_(), "stable") +def my_custom_function( + age: pa.Array, + favorite_number: pa.Array, +) -> pa.Array: + pass +``` + +[#1040]: https://github.com/apache/datafusion-python/pull/1040 +[#1061]: https://github.com/apache/datafusion-python/pull/1061 + + +## `uv` package management + +[uv] is an extremely fast Python package manager, written in Rust. In the previous version +of `datafusion-python` we had a combination of settings of PyPi and Conda. Instead, we +switch to using [uv] is our primary method for dependency management. + +For most users of DataFusion, this change will be transparent. You can still install +via `pip` or `conda`. For developers, the instructions in the repository have been updated. + +[uv]: https://github.com/astral-sh/uv + +## Code cleanup + +In an effort to improve our code cleanliness and ensure we are following Python best +practices, we use [ruff] to perform Python linting. Until now we enabled only a portion +of the available linters available. In PRs [#1055] and [#1062], we enable many more +of these linters and made code improvements to ensure we are following these +recommendations. + +[ruff]: https://docs.astral.sh/ruff/ +[#1055]: https://github.com/apache/datafusion-python/pull/1055 +[#1062]: https://github.com/apache/datafusion-python/pull/1062 + +## Improved Jupyter Notebook rendering + +Since PR [#839] in DataFusion 41.0.0 we have been able to render DataFrames using html in +[jupyter] notebooks. This is a big improvement over the `show` command when we have the +ability to render tables. In PR [#1036] we went a step further and added in a variety +of features. + +- Now html tables are scrollable, vertically and horizontally. +- When data are truncated, we report this to the user. +- Instead of showing a small number of rows, we collect up to 2 megabytes of data to +display. Since we have scrollable tables, we are able to make more data available +to the user without sacrificing notebook usability. +- We report explicitly when the DataFrame is empty. Previously we would not output +anything for an empty table. This indicator is helpful to users to ensure their plans +are written correctly. Sometimes a non-output can be overlooked. +- For long output of data, we generate a collapsed view of the data with an option +for the user to click on it to expand the data. + +In the below view you can see an example of some of these features such as the +expandable text and scroll bars. + +<figure style="text-align: center;"> + <img + src="/blog/images/python-datafusion-46.0.0/html_rendering.png" + width="100%" + class="img-responsive" + alt="Fig 1: Example html rendering in a jupyter notebook." + > + <figcaption> + <b>Figure 1</b>: With the html rendering enhancements, tables are more easily + viewable in jupyter notebooks. +</figcaption> +</figure> + +[jupyter]: https://jupyter.org/ +[#839]: https://github.com/apache/datafusion-python/pull/839 +[#1036]: https://github.com/apache/datafusion-python/pull/1036 + +## Extension Documentation + +We have recently added [Extension Documentation] to the DataFusion in Python website. We +have received many requests about how to better understand how to integrate DataFusion +in Python with other Rust libraries. To address these questions we wrote an article about +some of the difficulties that we encounter when using Rust libraries in Python and our +approach to addressing them. + +[Extension Documentation]: https://datafusion.apache.org/python/contributor-guide/ffi.html + +## Migration Guide + +During the upgrade from [DataFusion 43.0.0] to [DataFusion 44.0.0] as our upstream core +dependency, we discovered a few changes were necessary within our repository and our +unit tests. These notes serve to help guide users who may encounter similar issues when +upgrading. + +- `RuntimeConfig` is now deprecated in favor of `RuntimeEnvBuilder`. The migration is +fairly straightforward, and the corresponding classes have been marked as deprecated. For +end users it should be simply a matter of changing the class name. +- If you perform a `concat` of a `string_view` and `string`, it will now return a +`string_view` instead of a `string`. This likely only impacts unit tests that are validating +return types. In general, it is recommended to switch to using `string_view` whenever +possible. You can see the blog articles [String View Pt 1] and [Pt 2] for more information +on these performance improvements. +- The function `date_part` now returns an `int32` instead of a `float64`. This is likely +only impactful to unit tests. +- We have upgraded the Python minimum version to 3.9 since 3.8 is no longer officially +supported. + +[DataFusion 43.0.0]: https://github.com/apache/datafusion/blob/main/dev/changelog/43.0.0.md +[DataFusion 44.0.0]: https://github.com/apache/datafusion/blob/main/dev/changelog/44.0.0.md +[String View Pt 1]: https://datafusion.apache.org/blog/2024/09/13/string-view-german-style-strings-part-1/ +[Pt 2]: https://datafusion.apache.org/blog/2024/09/13/string-view-german-style-strings-part-2/ + +## Coming Soon + +There is a lot of excitement around the upcoming work. This list is not comprehensive, but +a glimpse into some of the upcoming work includes: + +- Reusable DataFusion UDFs: The way user defined functions are currently written in +`datafusion-python` is slightly different from those written for the upstream Rust +`datafusion`. The core ideas are usually the same, but it means it takes effort for users +to re-implement functions already written for Rust projects to be usable in Python. Issue +[#1017] addresses this topic. Work is well underway to make it easier to expose these +user functions through the FFI boundary. This means that the work that already exists in +repositories such as those found in the [datafusion-contrib] project can be easily +re-used in Python. This will provide a low effort way to expose significant functionality +to the DataFusion in Python community. +- Additional table providers: We have work well underway to provide a host of table providers +to `datafusion-python` including: sqlite, duckdb, postgres, odbc, and mysql! In +[datafusion-contrib #279] we track the progress of this excellent work. Once complete, users +will be able to `pip install` this library and get easy access to all of these table +providers. This is another way we are leveraging the FFI work to greatly expand the usability +of `datafusion-python` with relatively low effort. +- External catalog and schema providers: For users who wish to go beyond table providers +and have an entire custom catalog with schema, Issue [#1091] tracks the progress of exposing +this in Python. With this work, if you have already written a Rust based table catalog you +will be able to interface it in Python similar to the work described for the table +providers above. + +This is only a sample of the great work that is being done. If there are features you would +love to see, we encourage you to open an issue and join us as we build something wonderful. + +[#1017]: https://github.com/apache/datafusion-python/issues/1017 +[datafusion-contrib #279]: https://github.com/datafusion-contrib/datafusion-table-providers/issues/279 +[#1091]: https://github.com/apache/datafusion-python/issues/1091 +[datafusion-contrib]: https://github.com/datafusion-contrib + +## Appreciation + +We would like to thank everyone who has helped with these releases through their helpful +conversations, code review, issue descriptions, and code authoring. We would especially +like to thank the following authors of PRs who made these releases possible, listed in +alphabetical order by username: [@chenkovsky], [@CrystalZhou0529], [@ion-elgreco], +[@jsai28], [@kevinjqliu], [@kylebarron], [@kosiew], [@nirnayroy], and [@Spaarsh]. + +Thank you! + +[@chenkovsky]: https://github.com/chenkovsky +[@CrystalZhou0529]: https://github.com/CrystalZhou0529 +[@ion-elgreco]: https://github.com/ion-elgreco +[@jsai28]: https://github.com/jsai28 +[@kevinjqliu]: https://github.com/kevinjqliu +[@kylebarron]: https://github.com/kylebarron +[@kosiew]: https://github.com/kosiew +[@nirnayroy]: https://github.com/nirnayroy +[@Spaarsh]: https://github.com/Spaarsh + +## Get Involved + +The DataFusion Python team is an active and engaging community and we would love +to have you join us and help the project. + +Here are some ways to get involved: + +* Learn more by visiting the [DataFusion Python project] page. + +* Try out the project and provide feedback, file issues, and contribute code. + +* Join us on [ASF Slack] or the [Arrow Rust Discord Server]. + +[DataFusion Python project]: https://datafusion.apache.org/python/index.html +[ASF Slack]: https://s.apache.org/slack-invite +[Arrow Rust Discord Server]: https://discord.gg/Qw5gKqHxUM diff --git a/content/images/python-datafusion-46.0.0/html_rendering.png b/content/images/python-datafusion-46.0.0/html_rendering.png new file mode 100644 index 0000000..c2a2363 Binary files /dev/null and b/content/images/python-datafusion-46.0.0/html_rendering.png differ diff --git a/pelicanconf.py b/pelicanconf.py index 82e9132..de91d11 100644 --- a/pelicanconf.py +++ b/pelicanconf.py @@ -67,7 +67,6 @@ FEED_RSS = "feed.xml" MARKDOWN = { 'extension_configs': { - 'markdown.extensions.codehilite': {'linenums': False}, 'markdown.extensions.fenced_code': {}, }, 'output_format': 'html5', --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@datafusion.apache.org For additional commands, e-mail: commits-h...@datafusion.apache.org