This is an automated email from the ASF dual-hosted git repository.
agrove pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion-site.git
The following commit(s) were added to refs/heads/main by this push:
new 690c888 DataFusion Python 43.1.0 announcement (#43)
690c888 is described below
commit 690c88824e0cc9a1ab02da9cf0c7b7f70d6dfd6b
Author: Tim Saucer <[email protected]>
AuthorDate: Fri Dec 20 20:30:31 2024 -0500
DataFusion Python 43.1.0 announcement (#43)
* Initial commit for df-python 43.1 announcement
* Change the date on the post
* Update font size on blog index page
* Update content/blog/2024-12-14-datafusion-python-43.1.0.md
Co-authored-by: Andy Grove <[email protected]>
---------
Co-authored-by: Tim Saucer <[email protected]>
Co-authored-by: Andy Grove <[email protected]>
---
.gitignore | 3 +-
.../blog/2024-12-14-datafusion-python-43.1.0.md | 199 +++++++++++++++++++++
content/css/blog_index.css | 10 +-
3 files changed, 208 insertions(+), 4 deletions(-)
diff --git a/.gitignore b/.gitignore
index 58a38dc..c4087ad 100644
--- a/.gitignore
+++ b/.gitignore
@@ -6,5 +6,6 @@ _site
vendor
.DS_Store
output
-blog
+/blog/
+!*/blog/
diff --git a/content/blog/2024-12-14-datafusion-python-43.1.0.md
b/content/blog/2024-12-14-datafusion-python-43.1.0.md
new file mode 100644
index 0000000..18bfa2b
--- /dev/null
+++ b/content/blog/2024-12-14-datafusion-python-43.1.0.md
@@ -0,0 +1,199 @@
+---
+layout: post
+title: Apache DataFusion Python 43.1.0 Released
+date: 2024-12-14
+author: timsaucer
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+We are happy to announce that [datafusion-python 43.1.0] has been released.
This release
+brings in all of the new features of the core [DataFusion 43.0.0] library.
Since the last
+blog post for [datafusion-python 40.1.0], a large number of improvements have
been made
+that can be found in the [changelogs].
+
+We would like to point out four features that are particularly noteworthy.
+
+- Arrow PyCapsule import and export
+- User-Defined Window Functions
+- Foreign Table Providers
+- String View performance enhancements
+
+[DataFusion 43.0.0]:
https://github.com/apache/datafusion/blob/main/dev/changelog/43.0.0.md
+[datafusion-python 43.1.0]: https://pypi.org/project/datafusion/43.1.0/
+[datafusion-python 40.1.0]:
https://datafusion.apache.org/blog/2024/08/20/python-datafusion-40.0.0/
+[changelogs]:
https://github.com/apache/datafusion-python/tree/main/dev/changelog
+
+## Arrow PyCapsule import and export
+
+Arrow has stable C interface for moving data between different libraries, but
difficulties
+sometimes arise when different Python libraries expose this interface through
different
+methods, requiring developers to write function calls for each library they
are attempting
+to work with. A better approach is to use the [Arrow PyCapsule Interface]
which gives a
+consistent method for exposing these data structures across libraries.
+
+In [PR #825], we introduced support for both importing and exporting Arrow
data in
+`datafusion-python`. With this improvement, you can now use a single function
call to import
+a table from **any** Python library that implements the [Arrow PyCapsule
Interface].
+Many popular libaries, such as [Pandas](https://pandas.pydata.org/) and
[Polars](https://pola.rs/)
+already support these interfaces.
+
+Suppose you have a Pandas and Polars DataFrames named `df_pandas` or
`df_polars`, respectively:
+
+```python
+ctx = SessionContext()
+df_dfn1 = ctx.from_arrow(df_pandas)
+df_dfn1.show()
+
+df_dfn2 = ctx.from_arrow(df_polars)
+df_dfn2.show()
+```
+
+One great thing about using this interface is that as any new library is
developed and
+uses these stable interfaces, they will work out of the box with DataFusion!
+
+Additionally, DataFusion DataFrames allow for exporting via the PyCapsule
interface. For example,
+to convert a DataFrame to a PyArrow table, it is simply
+
+```python
+import pyarrow as pa
+table = pa.table(df)
+```
+
+[Arrow PyCapsule Interface]:
https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html
+[PR #825]: https://github.com/apache/datafusion-python/pull/825
+
+## User-Defined Window Functions
+
+In `datafusion-python 42.0.0` we released User-Defined Window Support in [PR
#880].
+For a detailed description of how these work please see the online
documentation for
+all [user-defined functions]. Additionally the [examples folder] contains a
complete
+example demonstrating the four different modes of operation of window functions
+within DataFusion.
+
+[PR #880]: https://github.com/apache/datafusion-python/pull/880
+[user-defined functions]:
https://datafusion.apache.org/python/user-guide/common-operations/udf-and-udfa.html
+[examples folder]:
https://github.com/apache/datafusion-python/tree/main/examples
+
+## Foreign Table Providers
+
+In the core [DataFusion 43.0.0] release, support was added for a Foreign
Function
+Interface to table providers. This creates a stable way for sharing
functionality
+across different libraries, similar to the [Arrow C data interface] operates.
This
+enables libraries, such as [delta lake] and [datafusion-contrib] to write
their own
+table providers in Rust and expose them in Python without requiring a Rust
dependency
+on `datafusion-python`. This is important because it allows these libraries to
+operate with `datafusion-python` regardless of which version of `datafusion`
they
+were built against.
+
+To implement this feature in a table provider is quite simple. There is a
complete
+example in the [examples folder], but the relevant code is here, exposed as a
+Python function via [pyo3]:
+
+```rust
+ fn __datafusion_table_provider__<'py>(
+ &self,
+ py: Python<'py>,
+ ) -> PyResult<Bound<'py, PyCapsule>> {
+ let name = CString::new("datafusion_table_provider").unwrap();
+
+ let provider = self
+ .create_table()
+ .map_err(|e| PyRuntimeError::new_err(e.to_string()))?;
+ let provider = FFI_TableProvider::new(Arc::new(provider), false);
+
+ PyCapsule::new_bound(py, provider, Some(name.clone()))
+ }
+```
+
+That's it! All of the work of converting the table provider to use the FFI
interface
+is performed by the core library.
+
+[Arrow C data interface]:
https://arrow.apache.org/docs/format/CDataInterface.html
+[PR #921]: https://github.com/apache/datafusion-python/pull/921
+[delta lake]: https://delta.io/docs/
+[datafusion-contrib]:
https://github.com/datafusion-contrib/datafusion-table-providers
+[pyo3]: https://pyo3.rs/
+
+## String View performance enhancements
+
+In the core [DataFusion 43.0.0] release, the option to enable StringView by
default
+was turned on. This leads to some significant performance enhancements, but it
*may*
+require some changes to users of `datafusion-python`.
+
+To learn more about the excellent work on this feature please read [part 1]
and [part 2]
+of the blog post describing how these enhancements can lead to 20-200%
performance
+gains in some tests.
+
+During our testing we identified some cases where we needed to adjust
workflows to
+account for the fact that StringView is now the default type for string based
operations.
+First, when performing manipulations on string objects there is a perfomance
loss when
+needing to cast from string to string view or vice versa. To reap the best
performance,
+ideally all of your string type data will use StringView. For most users this
should be
+transparent. However if you specify a schema for reading or creating data,
then you
+likely need to change from `pa.string()` to `pa.string_view()`. For our
testing, this
+primarily happens during data loading operations and in unit tests.
+
+[part 1]:
https://datafusion.apache.org/blog/2024/09/13/string-view-german-style-strings-part-1/
+[part 2]:
https://datafusion.apache.org/blog/2024/09/13/string-view-german-style-strings-part-2/
+
+If you wish to disable StringView as the default type to retain the old
approach,
+you can do so following this example:
+
+```python
+from datafusion import SessionContext
+from datafusion import SessionConfig
+config =
SessionConfig({"datafusion.execution.parquet.schema_force_view_types": "false"})
+ctx = SessionContext(config=config)
+```
+
+## Appreciation
+
+We would like to thank everyone who has helped with these releases through
their helpful
+conversations, code review, issue descriptions, and code authoring. We would
especially
+like to thank the following authors of PRs who made these releases possible,
listed in
+alphabetical order by username: [@andygrove], [@drauschenbach], [@emgeee],
[@ion-elgreco],
+[@jcrist], [@kosiew], [@mesejo], [@Michael-J-Ward], and [@sir-sigurd].
+
+Thank you!
+
+[@andygrove]: https://github.com/andygrove
+[@drauschenbach]: https://github.com/drauschenbach
+[@emgeee]: https://github.com/emgeee
+[@ion-elgreco]: https://github.com/ion-elgreco
+[@jcrist]: https://github.com/jcrist
+[@kosiew]: https://github.com/kosiew
+[@mesejo]: https://github.com/mesejo
+[@Michael-J-Ward]: https://github.com/Michael-J-Ward
+[@sir-sigurd]: https://github.com/sir-sigurd
+
+## Get Involved
+
+The DataFusion Python team is an active and engaging community and we would
love
+to have you join us and help the project.
+
+Here are some ways to get involved:
+
+* Learn more by visiting the [DataFusion Python project]
+page.
+
+* Try out the project and provide feedback, file issues, and contribute code.
+
+[DataFusion Python project]: https://datafusion.apache.org/python/index.html
diff --git a/content/css/blog_index.css b/content/css/blog_index.css
index cc36581..546a4e5 100644
--- a/content/css/blog_index.css
+++ b/content/css/blog_index.css
@@ -4,13 +4,17 @@ Otherwise the headers appear too large.
*/
h1 {
- font-size: calc(1.325rem + .9vw); /* Article title */
+ font-size: calc(1.3rem + .6vw); /* Article title */
}
h2 {
- font-size: calc(1.3rem + .6vw); /* Main headers within article */
+ font-size: calc(1.275rem + .3vw); /* Main headers within article */
}
h3 {
- font-size: calc(1.275rem + .3vw);
+ font-size: calc(1.25rem);
+}
+
+h4 {
+ font-size: calc(1rem);
}
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]