This is an automated email from the ASF dual-hosted git repository.

agrove pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion-site.git


The following commit(s) were added to refs/heads/main by this push:
     new d183ceb  Blog post of datafusion-python updates 40.1.0 (#8)
d183ceb is described below

commit d183ceb2890901acec50fa8bf1d4e00087c0d028
Author: Tim Saucer <[email protected]>
AuthorDate: Tue Aug 20 09:27:12 2024 -0400

    Blog post of datafusion-python updates 40.1.0 (#8)
    
    * Initial commit of blog post of datafusion python updates
    
    * Add readability updates
    
    * Adding section on named structs
    
    * Updating version numbers
    
    * Add thank you section
    
    * Update _posts/2024-08-20-python-datafusion-40.0.0.md
    
    * Update _posts/2024-08-20-python-datafusion-40.0.0.md
    
    ---------
    
    Co-authored-by: Tim Saucer <[email protected]>
    Co-authored-by: Andy Grove <[email protected]>
---
 _posts/2024-08-20-python-datafusion-40.0.0.md      | 226 +++++++++++++++++++++
 .../pylance_error_checking.png                     | Bin 0 -> 39119 bytes
 .../vscode_hover_tooltip.png                       | Bin 0 -> 87320 bytes
 3 files changed, 226 insertions(+)

diff --git a/_posts/2024-08-20-python-datafusion-40.0.0.md 
b/_posts/2024-08-20-python-datafusion-40.0.0.md
new file mode 100644
index 0000000..7ced578
--- /dev/null
+++ b/_posts/2024-08-20-python-datafusion-40.0.0.md
@@ -0,0 +1,226 @@
+---
+layout: post
+title: "Apache DataFusion Python 40.1.0 Released, Significant usability 
updates"
+date: "2024-08-20 00:00:00"
+author: timsaucer
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+We are happy to announce that [DataFusion in Python 40.1.0] has been released. 
In addition to
+bringing in all of the new features of the core [DataFusion 40.0.0] package, 
this release
+contains *significant* updates to the user interface and documentation. We 
listened to the python
+user community to create a more *pythonic* experience. If you have not used 
the python interface to
+DataFusion before, this is an excellent time to give it a try!
+
+[DataFusion 40.0.0]: 
https://datafusion.apache.org/blog/2024/07/24/datafusion-40.0.0/
+[DataFusion in Python 40.1.0]: https://pypi.org/project/datafusion/40.1.0/
+
+## Background
+
+Until now, the python bindings for DataFusion have primarily been a thin layer 
to expose the
+underlying Rust functionality. This has been worked well for early adopters to 
use DataFusion
+within their Python projects, but some users have found it difficult to work 
with. As compared to
+other DataFrame libraries, these issues were raised:
+
+1. Most of the functions had little or no documentation. Users often had to 
refer to the Rust
+documentation or code to learn how to use DataFusion. This alienated some 
python users.
+2. Users could not take advantage of modern IDE features such as type hinting. 
These are valuable
+tools for rapid testing and development.
+3. Some of the interfaces felt “clunky” to users since some Python concepts do 
not always map well
+to their Rust counterparts.
+
+This release aims to bring a better user experience to the DataFusion Python 
community.
+
+## What's Changed
+
+The most significant difference is that we have added wrapper functions and 
classes for most of the
+user facing interface. These wrappers, written in Python, contain both 
documentation and type
+annotations.
+
+This documenation is now available on the [DataFusion in 
Python](https://datafusion.apache.org/python/api.html)
+website. There you can browse the available functions and classes to see the 
breadth of available
+functionality.
+
+Modern IDEs use language servers such as
+[Pylance](https://marketplace.visualstudio.com/items?itemName=ms-python.vscode-pylance)
 or
+[Jedi](https://jedi.readthedocs.io/en/latest/) to perform analysis of python 
code, provide useful
+hints, and identify usage errors. These are major tools in the python user 
community. With this
+release, users can fully use these tools in their workflow.
+
+<figure style="text-align: center;">
+  <img 
+    src="{{ site.baseurl 
}}/img/python-datafusion-40.0.0/vscode_hover_tooltip.png" 
+    width="100%"
+    class="img-responsive"
+    alt="Fig 1: Enhanced tooltips in an IDE."
+  >
+  <figcaption>
+   <b>Figure 1</b>: With the enhanced python wrappers, users can see helpful 
tool tips with
+   type annotations directly in modern IDEs.
+</figcaption>
+</figure>
+
+By having the type annotations, these IDEs can also identify quickly when a 
user has incorrectly
+used a function's arguments as shown in Figure 2.
+
+<figure style="text-align: center;">
+  <img 
+    src="{{ site.baseurl 
}}/img/python-datafusion-40.0.0/pylance_error_checking.png"
+    width="100%"
+    class="img-responsive"
+    alt="Fig 2: Error checking in static analysis"
+  >
+  <figcaption>
+   <b>Figure 2</b>: Modern Python language servers can perform static analysis 
and quickly find
+   errors in the arguments to functions.
+</figcaption>
+</figure>
+
+In addition to these wrapper libraries, we have enhancements to some of the 
functions to feel more
+easy to use.
+
+### Improved DataFrame filter arguments
+
+You can now apply multiple `filter` statements in a single step. When using 
`DataFrame.filter` you
+can pass in multiple arguments, separated by a comma. These will act as a 
logical `AND` of all of
+the filter arguments. The following two statements are equivalent:
+
+```python
+df.filter(col("size") < col("max_size")).filter(col("color") == lit("green"))
+df.filter(col("size") < col("max_size"), col("color") == lit("green"))
+```
+
+### Comparison against literal values
+
+It is very common to write DataFrame operations that compare an expression to 
some fixed value.
+For example, filtering a DataFrame might have an operation such as 
`df.filter(col("size") < lit(16))`.
+To make these common operations more ergonomic, you can now simply use 
`df.filter(col("size") < 16)`.
+
+For the right hand side of the comparison operator, you can now use any Python 
value that can be
+coerced into a `Literal`. This gives an easy to ready expression. For example, 
consider these few
+lines from one of the
+[TPC-H 
examples](https://github.com/apache/datafusion-python/tree/main/examples/tpch) 
provided in
+the DataFusion Python repository.
+
+```python
+df = (
+    df_lineitem.filter(col("l_shipdate") >= lit(date))
+    .filter(col("l_discount") >= lit(DISCOUNT) - lit(DELTA))
+    .filter(col("l_discount") <= lit(DISCOUNT) + lit(DELTA))
+    .filter(col("l_quantity") < lit(QUANTITY))
+)
+```
+
+The above code mirrors closely how these filters would need to be applied in 
rust. With this new
+release, the user can simplify these lines. Also shown in the example below is 
that `filter()`
+now accepts a variable number of arguments and filters on all such arguments 
(boolean AND).
+
+```python
+df = df_lineitem.filter(
+    col("l_shipdate") >= date,
+    col("l_discount") >= DISCOUNT - DELTA,
+    col("l_discount") <= DISCOUNT + DELTA,
+    col("l_quantity") < QUANTITY,
+)
+```
+
+### Select columns by name
+
+It is very common for users to perform `DataFrame` selection where they simply 
want a column. For
+this we have had the function `select_columns("a", "b")` or the user could 
perform
+`select(col("a"), col("b"))`. In the new release, we accept either full 
expressions in `select()`
+or strings of the column names. You can mix these as well.
+
+Where before you may have to do an operation like
+
+```python
+df_subset = df.select(col("a"), col("b"), f.abs(col("c")))
+```
+
+You can now simplify this to
+
+```python
+df_subset = df.select("a", "b", f.abs(col("c")))
+```
+
+### Creating named structs
+
+Creating a `struct` with named fields was previously difficult to use and 
allowed for potential
+user errors when specifying the name of each field. Now we have a cleaner 
interface where the
+user passes a list of tuples containing the name of the field and the 
expression to create.
+
+```python
+df.select(f.named_struct([
+  ("a", col("a")),
+  ("b", col("b"))
+]))
+```
+
+## Next Steps
+
+While most of the user facing classes and functions have been exposed, there 
are a few that require
+exposure. Namely the classes in `datafusion.object_store` and the logical 
plans used by
+`datafusion.substrait`. The team is working on
+[these issues](https://github.com/apache/datafusion-python/issues/767).
+
+Additionally, in the next release of DataFusion there have been improvements 
made to the user-defined
+aggregate and window functions to make them easier to use. We plan on
+[bringing these 
enhancements](https://github.com/apache/datafusion-python/issues/780) to this 
project.
+
+## Thank You
+
+We would like to thank the following members for their very helpful 
discussions regarding these
+updates: [@andygrove], [@max-muoto], [@slyons], [@Throne3d], 
[@Michael-J-Ward], [@datapythonista],
+[@austin362667], [@kylebarron], [@simicd]. The [primary PR (#750)] that 
includes these updates
+had an extensive conversation, leading to a significantly improved end 
product. Again, thank you
+to all who provided input!
+
+We would like to give an special thank you to [@3ok] who created the initial 
version of the wrapper
+definitions. The work they did was time consuming and required exceptional 
attention to detail. It
+provided enormous value to starting this project. Thank you!
+
+## Get Involved
+
+The DataFusion Python team is an active and engaging community and we would 
love
+to have you join us and help the project.
+
+Here are some ways to get involved:
+
+* Learn more by visiting the [DataFusion Python 
project](https://datafusion.apache.org/python/index.html)
+page.
+
+* Try out the project and provide feedback, file issues, and contribute code.
+
+[mailing list]: https://lists.apache.org/[email protected]
+
+[@andygrove]: https://github.com/andygrove
+[@max-muoto]: https://github.com/max-muoto
+[@slyons]: https://github.com/slyons
+[@Throne3d]: https://github.com/Throne3d
+[@Michael-J-Ward]: https://github.com/Michael-J-Ward
+[@datapythonista]: https://github.com/datapythonista
+[@austin362667]: https://github.com/austin362667
+[@kylebarron]: https://github.com/kylebarron
+[@simicd]: https://github.com/simicd
+[@3ok]: https://github.com/3ok
+[primary PR (#750)]: https://github.com/apache/datafusion-python/pull/750
diff --git a/img/python-datafusion-40.0.0/pylance_error_checking.png 
b/img/python-datafusion-40.0.0/pylance_error_checking.png
new file mode 100644
index 0000000..2664bf3
Binary files /dev/null and 
b/img/python-datafusion-40.0.0/pylance_error_checking.png differ
diff --git a/img/python-datafusion-40.0.0/vscode_hover_tooltip.png 
b/img/python-datafusion-40.0.0/vscode_hover_tooltip.png
new file mode 100644
index 0000000..c1b49d7
Binary files /dev/null and 
b/img/python-datafusion-40.0.0/vscode_hover_tooltip.png differ


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to