(datafusion-site) branch main updated: datafusion-python 46.0.0 announcement (#65)

timsaucer Mon, 07 Apr 2025 05:08:31 -0700

This is an automated email from the ASF dual-hosted git repository.

timsaucer pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion-site.git



The following commit(s) were added to refs/heads/main by this push:
     new 4df2e5d  datafusion-python 46.0.0 announcement (#65)
4df2e5d is described below

commit 4df2e5de3bb6c6ff9482e8acf90adf28b81d5daa
Author: Tim Saucer <timsau...@gmail.com>
AuthorDate: Mon Apr 7 08:07:10 2025 -0400

    datafusion-python 46.0.0 announcement (#65)
    
    * datafusion-python-44 announcement
    
    * Update title
    
    * Updating blog post with links, but still needs to add text. Also need to 
update author list near end
    
    * Make all links at least work so CI will pass
    
    * Adding additional text to the release announcement
    
    * Heading level wrong
    
    * Minor formatting to match other posts, fixed one link
    
    * Respond to suggestions from code review
    
    * Remove codehilite since it doesn't play well with hilight.js and the 
latter has broader language support
---
 .../blog/2025-03-30-datafusion-python-46.0.0.md    | 300 +++++++++++++++++++++
 .../python-datafusion-46.0.0/html_rendering.png    | Bin 0 -> 189126 bytes
 pelicanconf.py                                     |   1 -
 3 files changed, 300 insertions(+), 1 deletion(-)

diff --git a/content/blog/2025-03-30-datafusion-python-46.0.0.md 
b/content/blog/2025-03-30-datafusion-python-46.0.0.md
new file mode 100644
index 0000000..8252bbd
--- /dev/null
+++ b/content/blog/2025-03-30-datafusion-python-46.0.0.md
@@ -0,0 +1,300 @@
+---
+layout: post
+title: Apache DataFusion Python 46.0.0 Released
+date: 2025-03-30
+author: timsaucer
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+
+We are happy to announce that [datafusion-python 46.0.0] has been released. 
This release
+brings in all of the new features of the core [DataFusion 46.0.0] library. 
Since the last
+blog post for [datafusion-python 43.1.0], a large number of improvements have 
been made
+that can be found in the [changelogs].
+
+We highly recommend reviewing the upstream [DataFusion 46.0.0] announcement.
+
+[DataFusion 46.0.0]: 
https://datafusion.apache.org/blog/2025/03/24/datafusion-46.0.0
+[datafusion-python 43.1.0]: 
https://datafusion.apache.org/blog/2024/12/14/datafusion-python-43.1.0/
+[datafusion-python 46.0.0]: https://pypi.org/project/datafusion/46.0.0/
+[changelogs]: 
https://github.com/apache/datafusion-python/tree/main/dev/changelog
+
+## Easier file reading
+
+In these releases we have introduced two new ways to more easily read files 
into
+DataFrames.
+
+PR [#982] introduced a series of easier read functions for Parquet, JSON, CSV, 
and
+AVRO files. This introduces a concept of a global context that is available by
+default when using these methods. Now instead of creating a default Session
+Context and then calling the read methods, you can simply import these read
+alternative methods and begin working with your DataFrames. Below is an 
example of
+how easy to use this new approach is.
+
+```python
+from datafusion.io import read_parquet
+df = read_parquet(path="./examples/tpch/data/customer.parquet")
+```
+
+PR [#980] adds a method for setting up a session context to use URL tables. 
With
+this enabled, you can use a path to a local file as a table name. An example
+of how to use this is demonstrated in the following snippet.
+
+```python
+import datafusion
+ctx = datafusion.SessionContext().enable_url_table()
+df = ctx.table("./examples/tpch/data/customer.parquet")
+```
+
+[#982]: https://github.com/apache/datafusion-python/pull/982
+[#980]: https://github.com/apache/datafusion-python/pull/980
+
+## Registering Table Views
+
+DataFusion supports registering a logical plan as a view with a session 
context. This
+allows creating views in one part of your work flow and passinng the session
+context to other places where that logical plan can be reused. This is an 
useful
+feature for building up complex workflows and for code clarity. PR [#1016] 
enables this
+feature in `datafusion-python`.
+
+For example, supposing you have a DataFrame called `df1`, you could use this 
code snippet
+to register the view and then use it in another place:
+
+```python
+ctx.register_view("view1", df1)
+```
+
+And then in another portion of your code which has access to the same session 
context
+you can retrive the DataFrame with:
+
+```
+df2 = ctx.table("view1")
+```
+
+[#1016]: https://github.com/apache/datafusion-python/pull/1016
+
+## Asynchronous Iteration of Record Batches
+
+Retrieving a `RecordBatch` from a `RecordBatchStream` was a synchronous call, 
which would
+require the end user's code to wait for the data retrieval. This is described 
in
+[Issue 974]. We continue to support this as a synchronous iterator, but we 
have also added
+in the ability to retrieve the `RecordBatch` using the Python asynchronous 
`anext`
+function.
+
+[Issue 974]: https://github.com/apache/datafusion-python/issues/974
+
+## Default ZSTD Compression for Parquet files
+
+With PR [#981], we change the saving of Parquet files to use zstd compression 
by default.
+Previously the default was uncompressed, causing excessive disk storage. Zstd 
is an
+excellent compression scheme that balances speed and compression ratio. Users 
can still
+save their Parquet files uncompressed by passing in the appropriate value to 
the
+`compression` argument when calling `DataFrame.write_parquet`.
+
+[#981]: https://github.com/apache/datafusion-python/pull/981
+
+## UDF Decorators
+
+In PRs [#1040] and [#1061] we add methods to make creating user defined 
functions
+easier and take advantage of Python decorators. With these PRs you can save a 
step
+from defining a method and then defining a udf of that method. Instead you can
+simply add the appropriate `udf` decorator. Similar methods exist for aggregate
+and window user defined functions.
+
+```python
+@udf([pa.int64(), pa.int64()], pa.bool_(), "stable")
+def my_custom_function(
+    age: pa.Array,
+    favorite_number: pa.Array,
+) -> pa.Array:
+    pass
+```
+
+[#1040]: https://github.com/apache/datafusion-python/pull/1040
+[#1061]: https://github.com/apache/datafusion-python/pull/1061
+
+
+## `uv` package management
+
+[uv] is an extremely fast Python package manager, written in Rust. In the 
previous version
+of `datafusion-python` we had a combination of settings of PyPi and Conda. 
Instead, we
+switch to using [uv] is our primary method for dependency management.
+
+For most users of DataFusion, this change will be transparent. You can still 
install
+via `pip` or `conda`. For developers, the instructions in the repository have 
been updated.
+
+[uv]: https://github.com/astral-sh/uv
+
+## Code cleanup
+
+In an effort to improve our code cleanliness and ensure we are following 
Python best
+practices, we use [ruff] to perform Python linting. Until now we enabled only 
a portion
+of the available linters available. In PRs [#1055] and [#1062], we enable many 
more
+of these linters and made code improvements to ensure we are following these
+recommendations.
+
+[ruff]: https://docs.astral.sh/ruff/
+[#1055]: https://github.com/apache/datafusion-python/pull/1055
+[#1062]: https://github.com/apache/datafusion-python/pull/1062
+
+## Improved Jupyter Notebook rendering
+
+Since PR [#839] in DataFusion 41.0.0 we have been able to render DataFrames 
using html in
+[jupyter] notebooks. This is a big improvement over the `show` command when we 
have the
+ability to render tables. In PR [#1036] we went a step further and added in a 
variety
+of features.
+
+- Now html tables are scrollable, vertically and horizontally.
+- When data are truncated, we report this to the user.
+- Instead of showing a small number of rows, we collect up to 2 megabytes of 
data to
+display. Since we have scrollable tables, we are able to make more data 
available
+to the user without sacrificing notebook usability.
+- We report explicitly when the DataFrame is empty. Previously we would not 
output
+anything for an empty table. This indicator is helpful to users to ensure 
their plans
+are written correctly. Sometimes a non-output can be overlooked.
+- For long output of data, we generate a collapsed view of the data with an 
option
+for the user to click on it to expand the data.
+
+In the below view you can see an example of some of these features such as the
+expandable text and scroll bars.
+
+<figure style="text-align: center;">
+  <img 
+    src="/blog/images/python-datafusion-46.0.0/html_rendering.png" 
+    width="100%"
+    class="img-responsive"
+    alt="Fig 1: Example html rendering in a jupyter notebook."
+  >
+  <figcaption>
+   <b>Figure 1</b>: With the html rendering enhancements, tables are more 
easily
+   viewable in jupyter notebooks.
+</figcaption>
+</figure>
+
+[jupyter]: https://jupyter.org/
+[#839]: https://github.com/apache/datafusion-python/pull/839
+[#1036]: https://github.com/apache/datafusion-python/pull/1036
+
+## Extension Documentation
+
+We have recently added [Extension Documentation] to the DataFusion in Python 
website. We
+have received many requests about how to better understand how to integrate 
DataFusion
+in Python with other Rust libraries. To address these questions we wrote an 
article about
+some of the difficulties that we encounter when using Rust libraries in Python 
and our
+approach to addressing them.
+
+[Extension Documentation]: 
https://datafusion.apache.org/python/contributor-guide/ffi.html
+
+## Migration Guide
+
+During the upgrade from [DataFusion 43.0.0] to [DataFusion 44.0.0] as our 
upstream core
+dependency, we discovered a few changes were necessary within our repository 
and our
+unit tests. These notes serve to help guide users who may encounter similar 
issues when
+upgrading.
+
+- `RuntimeConfig` is now deprecated in favor of `RuntimeEnvBuilder`. The 
migration is
+fairly straightforward, and the corresponding classes have been marked as 
deprecated. For
+end users it should be simply a matter of changing the class name.
+- If you perform a `concat` of a `string_view` and `string`, it will now 
return a
+`string_view` instead of a `string`. This likely only impacts unit tests that 
are validating
+return types. In general, it is recommended to switch to using `string_view` 
whenever 
+possible. You can see the blog articles [String View Pt 1] and [Pt 2] for more 
information
+on these performance improvements.
+- The function `date_part` now returns an `int32` instead of a `float64`. This 
is likely
+only impactful to unit tests.
+- We have upgraded the Python minimum version to 3.9 since 3.8 is no longer 
officially
+supported.
+
+[DataFusion 43.0.0]: 
https://github.com/apache/datafusion/blob/main/dev/changelog/43.0.0.md
+[DataFusion 44.0.0]: 
https://github.com/apache/datafusion/blob/main/dev/changelog/44.0.0.md
+[String View Pt 1]: 
https://datafusion.apache.org/blog/2024/09/13/string-view-german-style-strings-part-1/
+[Pt 2]: 
https://datafusion.apache.org/blog/2024/09/13/string-view-german-style-strings-part-2/
+
+## Coming Soon
+
+There is a lot of excitement around the upcoming work. This list is not 
comprehensive, but
+a glimpse into some of the upcoming work includes:
+
+- Reusable DataFusion UDFs: The way user defined functions are currently 
written in
+`datafusion-python` is slightly different from those written for the upstream 
Rust
+`datafusion`. The core ideas are usually the same, but it means it takes 
effort for users
+to re-implement functions already written for Rust projects to be usable in 
Python. Issue
+[#1017] addresses this topic. Work is well underway to make it easier to 
expose these
+user functions through the FFI boundary. This means that the work that already 
exists in
+repositories such as those found in the [datafusion-contrib] project can be 
easily
+re-used in Python. This will provide a low effort way to expose significant 
functionality
+to the DataFusion in Python community.
+- Additional table providers: We have work well underway to provide a host of 
table providers
+to `datafusion-python` including: sqlite, duckdb, postgres, odbc, and mysql! In
+[datafusion-contrib #279] we track the progress of this excellent work. Once 
complete, users
+will be able to `pip install` this library and get easy access to all of these 
table
+providers. This is another way we are leveraging the FFI work to greatly 
expand the usability
+of `datafusion-python` with relatively low effort.
+- External catalog and schema providers: For users who wish to go beyond table 
providers
+and have an entire custom catalog with schema, Issue [#1091] tracks the 
progress of exposing
+this in Python. With this work, if you have already written a Rust based table 
catalog you
+will be able to interface it in Python similar to the work described for the 
table
+providers above.
+
+This is only a sample of the great work that is being done. If there are 
features you would
+love to see, we encourage you to open an issue and join us as we build 
something wonderful.
+
+[#1017]: https://github.com/apache/datafusion-python/issues/1017
+[datafusion-contrib #279]: 
https://github.com/datafusion-contrib/datafusion-table-providers/issues/279
+[#1091]: https://github.com/apache/datafusion-python/issues/1091
+[datafusion-contrib]: https://github.com/datafusion-contrib
+
+## Appreciation
+
+We would like to thank everyone who has helped with these releases through 
their helpful
+conversations, code review, issue descriptions, and code authoring. We would 
especially
+like to thank the following authors of PRs who made these releases possible, 
listed in
+alphabetical order by username: [@chenkovsky], [@CrystalZhou0529], 
[@ion-elgreco],
+[@jsai28], [@kevinjqliu], [@kylebarron], [@kosiew], [@nirnayroy], and 
[@Spaarsh].
+
+Thank you!
+
+[@chenkovsky]: https://github.com/chenkovsky
+[@CrystalZhou0529]: https://github.com/CrystalZhou0529
+[@ion-elgreco]: https://github.com/ion-elgreco
+[@jsai28]: https://github.com/jsai28
+[@kevinjqliu]: https://github.com/kevinjqliu
+[@kylebarron]: https://github.com/kylebarron
+[@kosiew]: https://github.com/kosiew
+[@nirnayroy]: https://github.com/nirnayroy
+[@Spaarsh]: https://github.com/Spaarsh
+
+## Get Involved
+
+The DataFusion Python team is an active and engaging community and we would 
love
+to have you join us and help the project.
+
+Here are some ways to get involved:
+
+* Learn more by visiting the [DataFusion Python project] page.
+
+* Try out the project and provide feedback, file issues, and contribute code.
+
+* Join us on [ASF Slack] or the [Arrow Rust Discord Server].
+
+[DataFusion Python project]: https://datafusion.apache.org/python/index.html
+[ASF Slack]: https://s.apache.org/slack-invite
+[Arrow Rust Discord Server]: https://discord.gg/Qw5gKqHxUM
diff --git a/content/images/python-datafusion-46.0.0/html_rendering.png 
b/content/images/python-datafusion-46.0.0/html_rendering.png
new file mode 100644
index 0000000..c2a2363
Binary files /dev/null and 
b/content/images/python-datafusion-46.0.0/html_rendering.png differ
diff --git a/pelicanconf.py b/pelicanconf.py
index 82e9132..de91d11 100644
--- a/pelicanconf.py
+++ b/pelicanconf.py
@@ -67,7 +67,6 @@ FEED_RSS = "feed.xml"
 
 MARKDOWN = {
     'extension_configs': {
-        'markdown.extensions.codehilite': {'linenums': False},
         'markdown.extensions.fenced_code': {},
     },
     'output_format': 'html5',


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@datafusion.apache.org
For additional commands, e-mail: commits-h...@datafusion.apache.org

(datafusion-site) branch main updated: datafusion-python 46.0.0 announcement (#65)

Reply via email to