Re: [PR] Blog post on using UDFs in python [datafusion-site]

via GitHub Mon, 12 Aug 2024 11:43:41 -0700


kylebarron commented on code in PR #17:
URL: https://github.com/apache/datafusion-site/pull/17#discussion_r1714213073



##########
_posts/2024-08-06-datafusion-python-udf-comparisons.md:
##########
@@ -0,0 +1,595 @@
+---
+layout: post
+title: "Comparing approaches to User Defined Functions in Apache Datafusion 
using Python"
+date: "2024-08-06 00:00:00"
+author: timsaucer
+categories: [tutorial]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+# Writing User Defined Functions in Apache Datafusion using Python
+
+## Personal Context
+
+For a few months now I’ve been working with Apache DataFusion, a fast query 
engine written in rust.
+From my experience the language that nearly all data scientists are working in 
is Python. In
+general, for in memory work people often stick to pandas and pyspark for 
larger tasks that cannot
+fit into memory. Polars is also growing extremely fast.
+
+Personally, I would love a single query approach that is fast for both in 
memory usage and can
+extend to large batch processing to exploit parallelization. I think 
DataFusion, coupled with
+Ballista, may provide this solution.
+
+As I’m testing, I’m primarily limiting my work to the datafusion-python 
project, a wrapper around
+the rust library. This wrapper gives you the speed advantages of keeping all 
of the data in the
+rust implementation and the ergonomics of working in python. Personally, I 
would prefer to work
+purely in rust, but I also recognize that since the industry works in python 
we should meet the
+people where they are.
+
+## User Defined Functions
+
+The focus of this post is User Defined Functions. The DataFusion library gives 
a lot of useful

Review Comment:
   Nit: there's usually a hyphen for user-defined



##########
_posts/2024-08-06-datafusion-python-udf-comparisons.md:
##########
@@ -0,0 +1,595 @@
+---
+layout: post
+title: "Comparing approaches to User Defined Functions in Apache Datafusion 
using Python"
+date: "2024-08-06 00:00:00"
+author: timsaucer
+categories: [tutorial]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+# Writing User Defined Functions in Apache Datafusion using Python
+
+## Personal Context
+
+For a few months now I’ve been working with Apache DataFusion, a fast query 
engine written in rust.
+From my experience the language that nearly all data scientists are working in 
is Python. In
+general, for in memory work people often stick to pandas and pyspark for 
larger tasks that cannot
+fit into memory. Polars is also growing extremely fast.
+
+Personally, I would love a single query approach that is fast for both in 
memory usage and can
+extend to large batch processing to exploit parallelization. I think 
DataFusion, coupled with
+Ballista, may provide this solution.
+
+As I’m testing, I’m primarily limiting my work to the datafusion-python 
project, a wrapper around
+the rust library. This wrapper gives you the speed advantages of keeping all 
of the data in the
+rust implementation and the ergonomics of working in python. Personally, I 
would prefer to work
+purely in rust, but I also recognize that since the industry works in python 
we should meet the
+people where they are.
+
+## User Defined Functions
+
+The focus of this post is User Defined Functions. The DataFusion library gives 
a lot of useful
+functions already for doing dataframe manipulation. These are going to be 
similar to those you
+find in other dataframe libraries. You’ll be able to do simple arithmetic, 
create substrings of
+columns, or find the average value across a group of rows. These cover most of 
the use cases
+you’ll need in a DataFrame.
+
+However, there will always arise times when you want a custom function. By 
using user defined

Review Comment:
   ```suggestion
   However, there will always arise times when you want a custom function. With 
user defined
   ```



##########
_posts/2024-08-06-datafusion-python-udf-comparisons.md:
##########
@@ -0,0 +1,595 @@
+---
+layout: post
+title: "Comparing approaches to User Defined Functions in Apache Datafusion 
using Python"
+date: "2024-08-06 00:00:00"
+author: timsaucer
+categories: [tutorial]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+# Writing User Defined Functions in Apache Datafusion using Python
+
+## Personal Context
+
+For a few months now I’ve been working with Apache DataFusion, a fast query 
engine written in rust.
+From my experience the language that nearly all data scientists are working in 
is Python. In
+general, for in memory work people often stick to pandas and pyspark for 
larger tasks that cannot
+fit into memory. Polars is also growing extremely fast.
+
+Personally, I would love a single query approach that is fast for both in 
memory usage and can
+extend to large batch processing to exploit parallelization. I think 
DataFusion, coupled with
+Ballista, may provide this solution.
+
+As I’m testing, I’m primarily limiting my work to the datafusion-python 
project, a wrapper around
+the rust library. This wrapper gives you the speed advantages of keeping all 
of the data in the
+rust implementation and the ergonomics of working in python. Personally, I 
would prefer to work
+purely in rust, but I also recognize that since the industry works in python 
we should meet the
+people where they are.
+
+## User Defined Functions
+
+The focus of this post is User Defined Functions. The DataFusion library gives 
a lot of useful
+functions already for doing dataframe manipulation. These are going to be 
similar to those you

Review Comment:
   Nit: should there be consistent casing for `DataFrame`?



##########
_posts/2024-08-06-datafusion-python-udf-comparisons.md:
##########
@@ -0,0 +1,595 @@
+---
+layout: post
+title: "Comparing approaches to User Defined Functions in Apache Datafusion 
using Python"
+date: "2024-08-06 00:00:00"
+author: timsaucer
+categories: [tutorial]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+# Writing User Defined Functions in Apache Datafusion using Python
+
+## Personal Context
+
+For a few months now I’ve been working with Apache DataFusion, a fast query 
engine written in rust.
+From my experience the language that nearly all data scientists are working in 
is Python. In
+general, for in memory work people often stick to pandas and pyspark for 
larger tasks that cannot
+fit into memory. Polars is also growing extremely fast.
+
+Personally, I would love a single query approach that is fast for both in 
memory usage and can
+extend to large batch processing to exploit parallelization. I think 
DataFusion, coupled with
+Ballista, may provide this solution.
+
+As I’m testing, I’m primarily limiting my work to the datafusion-python 
project, a wrapper around
+the rust library. This wrapper gives you the speed advantages of keeping all 
of the data in the
+rust implementation and the ergonomics of working in python. Personally, I 
would prefer to work
+purely in rust, but I also recognize that since the industry works in python 
we should meet the
+people where they are.
+
+## User Defined Functions
+
+The focus of this post is User Defined Functions. The DataFusion library gives 
a lot of useful
+functions already for doing dataframe manipulation. These are going to be 
similar to those you
+find in other dataframe libraries. You’ll be able to do simple arithmetic, 
create substrings of
+columns, or find the average value across a group of rows. These cover most of 
the use cases
+you’ll need in a DataFrame.
+
+However, there will always arise times when you want a custom function. By 
using user defined
+functions (UDFs) you open the world of possibilities of your code. Sometimes 
there simply isn’t an
+easy way to use built in functions to achieve your goals.
+
+In the following, I’m going to demonstrate two example use cases. These are 
based on real world
+problems I’ve encountered. Also I want to demonstrate the approach of “make it 
work, make it work
+well, make it fast” that is a motto I’ve seen thrown around in data science.
+
+I will demonstrate three approaches to writing UDFs. In order of increasing 
performance they are
+
+- Writing a pure python function to do your computation
+- Using the pyarrow libraries in python to accelerate your processing
+- Writing a UDF in rust and exposing it to python
+
+Additionally I will demonstrate two variants of this. The first will be nearly 
identical to the
+pyarrow library approach to simplicity of understanding how to connect the 
rust code to python. The
+second version we will do the iteration through the input arrays ourselves to 
give even greater
+flexibility to the user.
+
+Here are the two example use cases, taken from my own work but generalized.
+
+### Use Case 1: Scalar Function
+
+I have a DataFrame and a list of tuples that I’m interested in. I want to 
filter out the dataframe
+to only have values that match those tuples from certain columns in the 
dataframe. For example,
+suppose I have a table of sales line items. There are many columns, but I am 
interested in three: a
+part key, supplier key, and return status. I want only to return a dataframe 
with a specific
+combination of these three values.
+
+Probably the most ergonomic way to do this without UDF is to turn that list of 
tuples into a
+dataframe itself, perform a join, and select the columns from the original 
dataframe. If we were
+working in pyspark we would probably broadcast join the dataframe created from 
the tuple list since
+it is tiny. In practice, I have found that with some dataframe libraries 
performing a filter rather
+than a join can be significantly faster. This is worth profiling for your 
specific use case.
+
+### Use Case 2: Aggregate or Window Function
+
+I have a dataframe with many values that I want to aggregate. I have already 
analyzed it and
+determined there is a noise level below which I do not want to include in my 
analysis. I want to
+compute a sum of only values that are above my noise threshold.
+
+This can be done fairly easy without leaning on a User Defined Aggegate 
Function (UDAF). You can
+simply filter the dataframe and then aggregate using the built in `sum` 
function. Here, we
+demonstrate doing this as a UDF primarily as an example of how to write UDAFs. 
We will use the
+pyarrow compute approach.
+
+## Pure Python approach
+
+The fastest way (developer time, not code time) for me to implement the scalar 
problem solution
+was to do something along the lines of “for each row, check the values of 
interest contains that
+tuple”. I’ve published this as [an 
example](https://github.com/apache/datafusion-python/blob/main/examples/python-udf-comparisons.py)
+in the [datafusion-python 
repository](https://github.com/apache/datafusion-python). Here is an
+example of how this can be done:
+
+```python
+values_of_interest = [
+    (1530, 4031, "N"),
+    (6530, 1531, "N"),
+    (5618, 619, "N"),
+    (8118, 8119, "N"),
+]
+
+def is_of_interest_impl(
+    partkey_arr: pa.Array,
+    suppkey_arr: pa.Array,
+    returnflag_arr: pa.Array,
+) -> pa.Array:
+    result = []
+    for idx, partkey in enumerate(partkey_arr):
+        partkey = partkey.as_py()
+        suppkey = suppkey_arr[idx].as_py()
+        returnflag = returnflag_arr[idx].as_py()
+        value = (partkey, suppkey, returnflag)
+        result.append(value in values_of_interest)
+
+    return pa.array(result)
+
+
+is_of_interest = udf(
+    is_of_interest_impl,
+    [pa.int64(), pa.int64(), pa.utf8()],
+    pa.bool_(),
+    "stable",
+)
+
+df_udf_filter = df_lineitem.filter(
+    is_of_interest(col("l_partkey"), col("l_suppkey"), col("l_returnflag"))
+)
+```
+
+When working with a DataFusion UDF in python, you define your function to take 
in some number of
+expressions. During the evaluation, these will get computed into their 
corresponding values and
+passed to your UDF as a pyarrow Array. We must return an Array also with the 
same number of
+elements (rows). So the UDF example just iterates through all of the arrays 
and checks to see if
+the tuple created from these columns matches any of those that we’re looking 
for.
+
+I’ll repeat because this is something that tripped me up the first time I 
wrote a UDF for
+datafusion: **Datafusion UDFs, even scalar UDFs, process an array of values at 
a time not a single
+row.** This is different from some other dataframe libraries and you may need 
to recognize a slight
+change in mentality.
+
+Some important lines here are the lines like `partkey = partkey.as_py()`. When 
we do this, we pay a
+heavy cost. Now instead of keeping the analysis in the rust code, we have to 
take the values in the
+array and convert them over to python objects. In this case we end up getting 
two numbers and a
+string as real python objects, complete with reference counting and all. Also 
we are iterating
+through the array in python rather than rust native. These will 
**significantly** slow down your
+code. Any time you have to cross the barrier where you change values inside 
the rust arrays into
+python objects or vice versa you will pay **heavy** cost in that 
transformation. You will want to
+design your UDFs to avoid this as much as possible.
+
+## Python approach using pyarrow compute
+
+DataFusion uses [Apache Arrow](https://arrow.apache.org/) as it’s in memory 
data format. This can
+be seen in the way that Arrow Arrays are passed into the UDFs. We can take 
advantage of the fact
+that [pyarrow](https://github.com/apache/arrow-rs), the python arrow wrapper, 
provides a variety of

Review Comment:
   ```suggestion
   that [pyarrow](https://arrow.apache.org/docs/python/), the canonical Python 
Arrow implementation, provides a variety of
   ```



##########
_posts/2024-08-06-datafusion-python-udf-comparisons.md:
##########
@@ -0,0 +1,595 @@
+---
+layout: post
+title: "Comparing approaches to User Defined Functions in Apache Datafusion 
using Python"
+date: "2024-08-06 00:00:00"
+author: timsaucer
+categories: [tutorial]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+# Writing User Defined Functions in Apache Datafusion using Python
+
+## Personal Context
+
+For a few months now I’ve been working with Apache DataFusion, a fast query 
engine written in rust.
+From my experience the language that nearly all data scientists are working in 
is Python. In
+general, for in memory work people often stick to pandas and pyspark for 
larger tasks that cannot
+fit into memory. Polars is also growing extremely fast.
+
+Personally, I would love a single query approach that is fast for both in 
memory usage and can
+extend to large batch processing to exploit parallelization. I think 
DataFusion, coupled with
+Ballista, may provide this solution.
+
+As I’m testing, I’m primarily limiting my work to the datafusion-python 
project, a wrapper around
+the rust library. This wrapper gives you the speed advantages of keeping all 
of the data in the
+rust implementation and the ergonomics of working in python. Personally, I 
would prefer to work
+purely in rust, but I also recognize that since the industry works in python 
we should meet the
+people where they are.
+
+## User Defined Functions
+
+The focus of this post is User Defined Functions. The DataFusion library gives 
a lot of useful
+functions already for doing dataframe manipulation. These are going to be 
similar to those you
+find in other dataframe libraries. You’ll be able to do simple arithmetic, 
create substrings of
+columns, or find the average value across a group of rows. These cover most of 
the use cases
+you’ll need in a DataFrame.
+
+However, there will always arise times when you want a custom function. By 
using user defined
+functions (UDFs) you open the world of possibilities of your code. Sometimes 
there simply isn’t an
+easy way to use built in functions to achieve your goals.
+
+In the following, I’m going to demonstrate two example use cases. These are 
based on real world
+problems I’ve encountered. Also I want to demonstrate the approach of “make it 
work, make it work
+well, make it fast” that is a motto I’ve seen thrown around in data science.
+
+I will demonstrate three approaches to writing UDFs. In order of increasing 
performance they are
+
+- Writing a pure python function to do your computation
+- Using the pyarrow libraries in python to accelerate your processing
+- Writing a UDF in rust and exposing it to python
+
+Additionally I will demonstrate two variants of this. The first will be nearly 
identical to the
+pyarrow library approach to simplicity of understanding how to connect the 
rust code to python. The
+second version we will do the iteration through the input arrays ourselves to 
give even greater
+flexibility to the user.
+
+Here are the two example use cases, taken from my own work but generalized.
+
+### Use Case 1: Scalar Function
+
+I have a DataFrame and a list of tuples that I’m interested in. I want to 
filter out the dataframe
+to only have values that match those tuples from certain columns in the 
dataframe. For example,
+suppose I have a table of sales line items. There are many columns, but I am 
interested in three: a
+part key, supplier key, and return status. I want only to return a dataframe 
with a specific
+combination of these three values.
+
+Probably the most ergonomic way to do this without UDF is to turn that list of 
tuples into a
+dataframe itself, perform a join, and select the columns from the original 
dataframe. If we were
+working in pyspark we would probably broadcast join the dataframe created from 
the tuple list since
+it is tiny. In practice, I have found that with some dataframe libraries 
performing a filter rather
+than a join can be significantly faster. This is worth profiling for your 
specific use case.
+
+### Use Case 2: Aggregate or Window Function
+
+I have a dataframe with many values that I want to aggregate. I have already 
analyzed it and
+determined there is a noise level below which I do not want to include in my 
analysis. I want to
+compute a sum of only values that are above my noise threshold.
+
+This can be done fairly easy without leaning on a User Defined Aggegate 
Function (UDAF). You can
+simply filter the dataframe and then aggregate using the built in `sum` 
function. Here, we
+demonstrate doing this as a UDF primarily as an example of how to write UDAFs. 
We will use the
+pyarrow compute approach.
+
+## Pure Python approach
+
+The fastest way (developer time, not code time) for me to implement the scalar 
problem solution
+was to do something along the lines of “for each row, check the values of 
interest contains that
+tuple”. I’ve published this as [an 
example](https://github.com/apache/datafusion-python/blob/main/examples/python-udf-comparisons.py)
+in the [datafusion-python 
repository](https://github.com/apache/datafusion-python). Here is an
+example of how this can be done:
+
+```python
+values_of_interest = [
+    (1530, 4031, "N"),
+    (6530, 1531, "N"),
+    (5618, 619, "N"),
+    (8118, 8119, "N"),
+]
+
+def is_of_interest_impl(
+    partkey_arr: pa.Array,
+    suppkey_arr: pa.Array,
+    returnflag_arr: pa.Array,
+) -> pa.Array:
+    result = []
+    for idx, partkey in enumerate(partkey_arr):
+        partkey = partkey.as_py()
+        suppkey = suppkey_arr[idx].as_py()
+        returnflag = returnflag_arr[idx].as_py()
+        value = (partkey, suppkey, returnflag)
+        result.append(value in values_of_interest)
+
+    return pa.array(result)
+
+
+is_of_interest = udf(

Review Comment:
   ```suggestion
   # Wrap our custom function with `datafusion.udf`, annotating expected 
   # parameter and return types
   is_of_interest = udf(
   ```
   
   As a separate note, it wouldn't be hard to convert this `udf` function 
wrapper into a Python decorator, so we could do
   
   ```py
   @udf(args=(pa.int64(), pa.int64(), pa.utf8()), returns=pa.bool_(), "stable")
   def is_of_interest(
       partkey_arr: pa.Array,
       suppkey_arr: pa.Array,
       returnflag_arr: pa.Array,
   ) -> pa.Array: ...
   ```



##########
_posts/2024-08-06-datafusion-python-udf-comparisons.md:
##########
@@ -0,0 +1,595 @@
+---
+layout: post
+title: "Comparing approaches to User Defined Functions in Apache Datafusion 
using Python"
+date: "2024-08-06 00:00:00"
+author: timsaucer
+categories: [tutorial]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+# Writing User Defined Functions in Apache Datafusion using Python
+
+## Personal Context
+
+For a few months now I’ve been working with Apache DataFusion, a fast query 
engine written in rust.
+From my experience the language that nearly all data scientists are working in 
is Python. In
+general, for in memory work people often stick to pandas and pyspark for 
larger tasks that cannot
+fit into memory. Polars is also growing extremely fast.
+
+Personally, I would love a single query approach that is fast for both in 
memory usage and can
+extend to large batch processing to exploit parallelization. I think 
DataFusion, coupled with
+Ballista, may provide this solution.
+
+As I’m testing, I’m primarily limiting my work to the datafusion-python 
project, a wrapper around
+the rust library. This wrapper gives you the speed advantages of keeping all 
of the data in the
+rust implementation and the ergonomics of working in python. Personally, I 
would prefer to work
+purely in rust, but I also recognize that since the industry works in python 
we should meet the
+people where they are.
+
+## User Defined Functions
+
+The focus of this post is User Defined Functions. The DataFusion library gives 
a lot of useful
+functions already for doing dataframe manipulation. These are going to be 
similar to those you
+find in other dataframe libraries. You’ll be able to do simple arithmetic, 
create substrings of
+columns, or find the average value across a group of rows. These cover most of 
the use cases
+you’ll need in a DataFrame.
+
+However, there will always arise times when you want a custom function. By 
using user defined
+functions (UDFs) you open the world of possibilities of your code. Sometimes 
there simply isn’t an
+easy way to use built in functions to achieve your goals.
+
+In the following, I’m going to demonstrate two example use cases. These are 
based on real world
+problems I’ve encountered. Also I want to demonstrate the approach of “make it 
work, make it work
+well, make it fast” that is a motto I’ve seen thrown around in data science.
+
+I will demonstrate three approaches to writing UDFs. In order of increasing 
performance they are
+
+- Writing a pure python function to do your computation
+- Using the pyarrow libraries in python to accelerate your processing
+- Writing a UDF in rust and exposing it to python
+
+Additionally I will demonstrate two variants of this. The first will be nearly 
identical to the
+pyarrow library approach to simplicity of understanding how to connect the 
rust code to python. The
+second version we will do the iteration through the input arrays ourselves to 
give even greater
+flexibility to the user.
+
+Here are the two example use cases, taken from my own work but generalized.
+
+### Use Case 1: Scalar Function
+
+I have a DataFrame and a list of tuples that I’m interested in. I want to 
filter out the dataframe
+to only have values that match those tuples from certain columns in the 
dataframe. For example,
+suppose I have a table of sales line items. There are many columns, but I am 
interested in three: a
+part key, supplier key, and return status. I want only to return a dataframe 
with a specific
+combination of these three values.
+
+Probably the most ergonomic way to do this without UDF is to turn that list of 
tuples into a
+dataframe itself, perform a join, and select the columns from the original 
dataframe. If we were
+working in pyspark we would probably broadcast join the dataframe created from 
the tuple list since
+it is tiny. In practice, I have found that with some dataframe libraries 
performing a filter rather
+than a join can be significantly faster. This is worth profiling for your 
specific use case.
+
+### Use Case 2: Aggregate or Window Function
+
+I have a dataframe with many values that I want to aggregate. I have already 
analyzed it and
+determined there is a noise level below which I do not want to include in my 
analysis. I want to
+compute a sum of only values that are above my noise threshold.
+
+This can be done fairly easy without leaning on a User Defined Aggegate 
Function (UDAF). You can
+simply filter the dataframe and then aggregate using the built in `sum` 
function. Here, we
+demonstrate doing this as a UDF primarily as an example of how to write UDAFs. 
We will use the
+pyarrow compute approach.
+
+## Pure Python approach
+
+The fastest way (developer time, not code time) for me to implement the scalar 
problem solution
+was to do something along the lines of “for each row, check the values of 
interest contains that
+tuple”. I’ve published this as [an 
example](https://github.com/apache/datafusion-python/blob/main/examples/python-udf-comparisons.py)
+in the [datafusion-python 
repository](https://github.com/apache/datafusion-python). Here is an
+example of how this can be done:
+
+```python
+values_of_interest = [
+    (1530, 4031, "N"),
+    (6530, 1531, "N"),
+    (5618, 619, "N"),
+    (8118, 8119, "N"),
+]
+
+def is_of_interest_impl(
+    partkey_arr: pa.Array,
+    suppkey_arr: pa.Array,
+    returnflag_arr: pa.Array,
+) -> pa.Array:
+    result = []
+    for idx, partkey in enumerate(partkey_arr):
+        partkey = partkey.as_py()
+        suppkey = suppkey_arr[idx].as_py()
+        returnflag = returnflag_arr[idx].as_py()
+        value = (partkey, suppkey, returnflag)
+        result.append(value in values_of_interest)
+
+    return pa.array(result)
+
+
+is_of_interest = udf(
+    is_of_interest_impl,
+    [pa.int64(), pa.int64(), pa.utf8()],
+    pa.bool_(),
+    "stable",
+)
+
+df_udf_filter = df_lineitem.filter(
+    is_of_interest(col("l_partkey"), col("l_suppkey"), col("l_returnflag"))
+)
+```
+
+When working with a DataFusion UDF in python, you define your function to take 
in some number of
+expressions. During the evaluation, these will get computed into their 
corresponding values and
+passed to your UDF as a pyarrow Array. We must return an Array also with the 
same number of
+elements (rows). So the UDF example just iterates through all of the arrays 
and checks to see if
+the tuple created from these columns matches any of those that we’re looking 
for.
+
+I’ll repeat because this is something that tripped me up the first time I 
wrote a UDF for
+datafusion: **Datafusion UDFs, even scalar UDFs, process an array of values at 
a time not a single
+row.** This is different from some other dataframe libraries and you may need 
to recognize a slight
+change in mentality.
+
+Some important lines here are the lines like `partkey = partkey.as_py()`. When 
we do this, we pay a
+heavy cost. Now instead of keeping the analysis in the rust code, we have to 
take the values in the
+array and convert them over to python objects. In this case we end up getting 
two numbers and a
+string as real python objects, complete with reference counting and all. Also 
we are iterating
+through the array in python rather than rust native. These will 
**significantly** slow down your
+code. Any time you have to cross the barrier where you change values inside 
the rust arrays into
+python objects or vice versa you will pay **heavy** cost in that 
transformation. You will want to
+design your UDFs to avoid this as much as possible.
+
+## Python approach using pyarrow compute
+
+DataFusion uses [Apache Arrow](https://arrow.apache.org/) as it’s in memory 
data format. This can
+be seen in the way that Arrow Arrays are passed into the UDFs. We can take 
advantage of the fact
+that [pyarrow](https://github.com/apache/arrow-rs), the python arrow wrapper, 
provides a variety of
+useful functions. In the example below, we are only using a few of the boolean 
functions and the
+equality function. Each of these functions takes two arrays and analyzes them 
row by row. In the
+below example, we shift the logic around a little since we are now operating 
on an entire array of
+values instead of checking a single row ourselves.
+
+```python
+import pyarrow.compute as pc
+
+def udf_using_pyarrow_compute_impl(
+    partkey_arr: pa.Array,
+    suppkey_arr: pa.Array,
+    returnflag_arr: pa.Array,
+) -> pa.Array:
+    results = None
+    for partkey, suppkey, returnflag in values_of_interest:

Review Comment:
   I think you can use `pyarrow.is_in` to speed this up, instead of doing an 
equality check multiple times: 
https://arrow.apache.org/docs/python/generated/pyarrow.compute.is_in.html



##########
_posts/2024-08-06-datafusion-python-udf-comparisons.md:
##########
@@ -0,0 +1,595 @@
+---
+layout: post
+title: "Comparing approaches to User Defined Functions in Apache Datafusion 
using Python"
+date: "2024-08-06 00:00:00"
+author: timsaucer
+categories: [tutorial]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+# Writing User Defined Functions in Apache Datafusion using Python
+
+## Personal Context
+
+For a few months now I’ve been working with Apache DataFusion, a fast query 
engine written in rust.
+From my experience the language that nearly all data scientists are working in 
is Python. In
+general, for in memory work people often stick to pandas and pyspark for 
larger tasks that cannot
+fit into memory. Polars is also growing extremely fast.
+
+Personally, I would love a single query approach that is fast for both in 
memory usage and can
+extend to large batch processing to exploit parallelization. I think 
DataFusion, coupled with
+Ballista, may provide this solution.

Review Comment:
   You mention Polars but then in the next line come back to DataFusion. It 
might be worth a line to say why you decided on DataFusion over Polars, since 
they seem to have relatively similar goals here. Or maybe take out the polars 
reference if this blog post isn't really meant as a comparison to polars?



##########
_posts/2024-08-06-datafusion-python-udf-comparisons.md:
##########
@@ -0,0 +1,595 @@
+---
+layout: post
+title: "Comparing approaches to User Defined Functions in Apache Datafusion 
using Python"
+date: "2024-08-06 00:00:00"
+author: timsaucer
+categories: [tutorial]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+# Writing User Defined Functions in Apache Datafusion using Python
+
+## Personal Context
+
+For a few months now I’ve been working with Apache DataFusion, a fast query 
engine written in rust.
+From my experience the language that nearly all data scientists are working in 
is Python. In
+general, for in memory work people often stick to pandas and pyspark for 
larger tasks that cannot
+fit into memory. Polars is also growing extremely fast.
+
+Personally, I would love a single query approach that is fast for both in 
memory usage and can
+extend to large batch processing to exploit parallelization. I think 
DataFusion, coupled with
+Ballista, may provide this solution.
+
+As I’m testing, I’m primarily limiting my work to the datafusion-python 
project, a wrapper around
+the rust library. This wrapper gives you the speed advantages of keeping all 
of the data in the
+rust implementation and the ergonomics of working in python. Personally, I 
would prefer to work
+purely in rust, but I also recognize that since the industry works in python 
we should meet the
+people where they are.
+
+## User Defined Functions
+
+The focus of this post is User Defined Functions. The DataFusion library gives 
a lot of useful
+functions already for doing dataframe manipulation. These are going to be 
similar to those you
+find in other dataframe libraries. You’ll be able to do simple arithmetic, 
create substrings of
+columns, or find the average value across a group of rows. These cover most of 
the use cases
+you’ll need in a DataFrame.
+
+However, there will always arise times when you want a custom function. By 
using user defined
+functions (UDFs) you open the world of possibilities of your code. Sometimes 
there simply isn’t an
+easy way to use built in functions to achieve your goals.
+
+In the following, I’m going to demonstrate two example use cases. These are 
based on real world
+problems I’ve encountered. Also I want to demonstrate the approach of “make it 
work, make it work
+well, make it fast” that is a motto I’ve seen thrown around in data science.
+
+I will demonstrate three approaches to writing UDFs. In order of increasing 
performance they are
+
+- Writing a pure python function to do your computation
+- Using the pyarrow libraries in python to accelerate your processing
+- Writing a UDF in rust and exposing it to python
+
+Additionally I will demonstrate two variants of this. The first will be nearly 
identical to the
+pyarrow library approach to simplicity of understanding how to connect the 
rust code to python. The
+second version we will do the iteration through the input arrays ourselves to 
give even greater
+flexibility to the user.
+
+Here are the two example use cases, taken from my own work but generalized.
+
+### Use Case 1: Scalar Function
+
+I have a DataFrame and a list of tuples that I’m interested in. I want to 
filter out the dataframe
+to only have values that match those tuples from certain columns in the 
dataframe. For example,
+suppose I have a table of sales line items. There are many columns, but I am 
interested in three: a
+part key, supplier key, and return status. I want only to return a dataframe 
with a specific
+combination of these three values.
+
+Probably the most ergonomic way to do this without UDF is to turn that list of 
tuples into a
+dataframe itself, perform a join, and select the columns from the original 
dataframe. If we were
+working in pyspark we would probably broadcast join the dataframe created from 
the tuple list since
+it is tiny. In practice, I have found that with some dataframe libraries 
performing a filter rather
+than a join can be significantly faster. This is worth profiling for your 
specific use case.
+
+### Use Case 2: Aggregate or Window Function
+
+I have a dataframe with many values that I want to aggregate. I have already 
analyzed it and
+determined there is a noise level below which I do not want to include in my 
analysis. I want to
+compute a sum of only values that are above my noise threshold.
+
+This can be done fairly easy without leaning on a User Defined Aggegate 
Function (UDAF). You can
+simply filter the dataframe and then aggregate using the built in `sum` 
function. Here, we
+demonstrate doing this as a UDF primarily as an example of how to write UDAFs. 
We will use the
+pyarrow compute approach.
+
+## Pure Python approach
+
+The fastest way (developer time, not code time) for me to implement the scalar 
problem solution
+was to do something along the lines of “for each row, check the values of 
interest contains that
+tuple”. I’ve published this as [an 
example](https://github.com/apache/datafusion-python/blob/main/examples/python-udf-comparisons.py)
+in the [datafusion-python 
repository](https://github.com/apache/datafusion-python). Here is an
+example of how this can be done:
+
+```python
+values_of_interest = [
+    (1530, 4031, "N"),
+    (6530, 1531, "N"),
+    (5618, 619, "N"),
+    (8118, 8119, "N"),
+]
+
+def is_of_interest_impl(
+    partkey_arr: pa.Array,
+    suppkey_arr: pa.Array,
+    returnflag_arr: pa.Array,
+) -> pa.Array:
+    result = []
+    for idx, partkey in enumerate(partkey_arr):
+        partkey = partkey.as_py()
+        suppkey = suppkey_arr[idx].as_py()
+        returnflag = returnflag_arr[idx].as_py()
+        value = (partkey, suppkey, returnflag)
+        result.append(value in values_of_interest)
+
+    return pa.array(result)
+
+
+is_of_interest = udf(
+    is_of_interest_impl,
+    [pa.int64(), pa.int64(), pa.utf8()],
+    pa.bool_(),
+    "stable",
+)
+
+df_udf_filter = df_lineitem.filter(
+    is_of_interest(col("l_partkey"), col("l_suppkey"), col("l_returnflag"))
+)
+```
+
+When working with a DataFusion UDF in python, you define your function to take 
in some number of
+expressions. During the evaluation, these will get computed into their 
corresponding values and
+passed to your UDF as a pyarrow Array. We must return an Array also with the 
same number of
+elements (rows). So the UDF example just iterates through all of the arrays 
and checks to see if
+the tuple created from these columns matches any of those that we’re looking 
for.
+
+I’ll repeat because this is something that tripped me up the first time I 
wrote a UDF for
+datafusion: **Datafusion UDFs, even scalar UDFs, process an array of values at 
a time not a single
+row.** This is different from some other dataframe libraries and you may need 
to recognize a slight
+change in mentality.
+
+Some important lines here are the lines like `partkey = partkey.as_py()`. When 
we do this, we pay a
+heavy cost. Now instead of keeping the analysis in the rust code, we have to 
take the values in the
+array and convert them over to python objects. In this case we end up getting 
two numbers and a
+string as real python objects, complete with reference counting and all. Also 
we are iterating
+through the array in python rather than rust native. These will 
**significantly** slow down your
+code. Any time you have to cross the barrier where you change values inside 
the rust arrays into
+python objects or vice versa you will pay **heavy** cost in that 
transformation. You will want to
+design your UDFs to avoid this as much as possible.
+
+## Python approach using pyarrow compute
+
+DataFusion uses [Apache Arrow](https://arrow.apache.org/) as it’s in memory 
data format. This can

Review Comment:
   ```suggestion
   DataFusion uses [Apache Arrow](https://arrow.apache.org/) as its in-memory 
data format. This can
   ```



##########
_posts/2024-08-06-datafusion-python-udf-comparisons.md:
##########
@@ -0,0 +1,595 @@
+---
+layout: post
+title: "Comparing approaches to User Defined Functions in Apache Datafusion 
using Python"
+date: "2024-08-06 00:00:00"
+author: timsaucer
+categories: [tutorial]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+# Writing User Defined Functions in Apache Datafusion using Python
+
+## Personal Context
+
+For a few months now I’ve been working with Apache DataFusion, a fast query 
engine written in rust.
+From my experience the language that nearly all data scientists are working in 
is Python. In
+general, for in memory work people often stick to pandas and pyspark for 
larger tasks that cannot
+fit into memory. Polars is also growing extremely fast.
+
+Personally, I would love a single query approach that is fast for both in 
memory usage and can
+extend to large batch processing to exploit parallelization. I think 
DataFusion, coupled with
+Ballista, may provide this solution.
+
+As I’m testing, I’m primarily limiting my work to the datafusion-python 
project, a wrapper around
+the rust library. This wrapper gives you the speed advantages of keeping all 
of the data in the
+rust implementation and the ergonomics of working in python. Personally, I 
would prefer to work
+purely in rust, but I also recognize that since the industry works in python 
we should meet the
+people where they are.
+
+## User Defined Functions
+
+The focus of this post is User Defined Functions. The DataFusion library gives 
a lot of useful
+functions already for doing dataframe manipulation. These are going to be 
similar to those you
+find in other dataframe libraries. You’ll be able to do simple arithmetic, 
create substrings of
+columns, or find the average value across a group of rows. These cover most of 
the use cases
+you’ll need in a DataFrame.
+
+However, there will always arise times when you want a custom function. By 
using user defined
+functions (UDFs) you open the world of possibilities of your code. Sometimes 
there simply isn’t an

Review Comment:
   ```suggestion
   functions (UDFs) you open a world of possibilities in your code. Sometimes 
there simply isn’t an
   ```



##########
_posts/2024-08-06-datafusion-python-udf-comparisons.md:
##########
@@ -0,0 +1,595 @@
+---
+layout: post
+title: "Comparing approaches to User Defined Functions in Apache Datafusion 
using Python"
+date: "2024-08-06 00:00:00"
+author: timsaucer
+categories: [tutorial]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+# Writing User Defined Functions in Apache Datafusion using Python
+
+## Personal Context
+
+For a few months now I’ve been working with Apache DataFusion, a fast query 
engine written in rust.
+From my experience the language that nearly all data scientists are working in 
is Python. In
+general, for in memory work people often stick to pandas and pyspark for 
larger tasks that cannot
+fit into memory. Polars is also growing extremely fast.
+
+Personally, I would love a single query approach that is fast for both in 
memory usage and can
+extend to large batch processing to exploit parallelization. I think 
DataFusion, coupled with
+Ballista, may provide this solution.
+
+As I’m testing, I’m primarily limiting my work to the datafusion-python 
project, a wrapper around
+the rust library. This wrapper gives you the speed advantages of keeping all 
of the data in the
+rust implementation and the ergonomics of working in python. Personally, I 
would prefer to work
+purely in rust, but I also recognize that since the industry works in python 
we should meet the
+people where they are.
+
+## User Defined Functions
+
+The focus of this post is User Defined Functions. The DataFusion library gives 
a lot of useful
+functions already for doing dataframe manipulation. These are going to be 
similar to those you
+find in other dataframe libraries. You’ll be able to do simple arithmetic, 
create substrings of
+columns, or find the average value across a group of rows. These cover most of 
the use cases
+you’ll need in a DataFrame.
+
+However, there will always arise times when you want a custom function. By 
using user defined
+functions (UDFs) you open the world of possibilities of your code. Sometimes 
there simply isn’t an
+easy way to use built in functions to achieve your goals.

Review Comment:
   ```suggestion
   easy way to use built-in functions to achieve your goals.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Blog post on using UDFs in python [datafusion-site]

Reply via email to