paleolimbot commented on code in PR #73:
URL: https://github.com/apache/datafusion-site/pull/73#discussion_r2135925472
##########
content/blog/2025-06-09-metadata-handling.md:
##########
@@ -0,0 +1,98 @@
+---
+layout: post
+title: Metadata handling in user defined functions
+date: 2025-06-09
+author: Tim Saucer
+categories: [core]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+
+[DataFusion 48.0.0] introduced a change in the interface for writing custom
functions
+which enables a variety of interesting improvements. Now users can access
additional
+data about the input columns to functions, such as their nullability and
metadata. This
+enables processing of extension types as well as a wide variety of other use
cases.
+
+TODO: UPDATE LINKS
+
+[DataFusion 48.0.0]: https://github.com/apache/datafusion/tree/48.0.0-rc3
+
+# Why metadata handling is important
+
+Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays.
Each
+[Field] in this `Schema` contains a name, data type, nullability, and
metadata. The
+metadata is specified as a map of key-value pairs of strings. In the new
+implementation, during processing of all user defined functions we pass the
input
+field information.
Review Comment:
```suggestion
metadata is specified as a map of key-value pairs of strings. This extra
metadata is used
by Arrow implementations implement [extension types] and can also be used to
add
use case-specific context to a column of values where the formality of an
extension type
is not required. In previous versions of DataFusion field metadata was
propagated through
certain operations (e.g., renaming or selecting a column) but was not
accessible to others
(e.g., scalar, window, or aggregate function calls). In the new
implementation, during
processing of all user defined functions we pass the input field information
and allow
user defined function implementations to return field information to the
caller.
[extension types]:
https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types
```
##########
content/blog/2025-06-09-metadata-handling.md:
##########
@@ -0,0 +1,98 @@
+---
+layout: post
+title: Metadata handling in user defined functions
+date: 2025-06-09
+author: Tim Saucer
+categories: [core]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+
+[DataFusion 48.0.0] introduced a change in the interface for writing custom
functions
+which enables a variety of interesting improvements. Now users can access
additional
+data about the input columns to functions, such as their nullability and
metadata. This
+enables processing of extension types as well as a wide variety of other use
cases.
+
+TODO: UPDATE LINKS
+
+[DataFusion 48.0.0]: https://github.com/apache/datafusion/tree/48.0.0-rc3
+
+# Why metadata handling is important
+
+Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays.
Each
+[Field] in this `Schema` contains a name, data type, nullability, and
metadata. The
+metadata is specified as a map of key-value pairs of strings. In the new
+implementation, during processing of all user defined functions we pass the
input
+field information.
+
+It is often desirable to write a generic function for reuse. With the prior
version of
+user defined functions, we only had access to the `DataType` of the input
columns. This
+works well for some features that only rely on the types of data. Other use
cases may
+need additional information that describes the data.
+
+For example, suppose I write a function that computes the force of gravity on
an object
+based on it's mass. The general equation is `F = m * g` where `g = 9.8 m/s`.
Suppose
+our documentation for the function specifies the output will be in Newtons.
This is only
+valid if the input unit is in kilograms. With our metadata enhancement, we
could update
+this function to now evaluate the input units, perform any kind of required
+transformation, and give consistent output every time. We could also have the
function
+return an error if an invalid input was given, such as providing an input
where the
+metadata says the units are in `meters` instead of a unit of mass.
Review Comment:
I wonder if we could turn this into a code example with DataFusion(Python?)
UDFs to make it more concrete (I can help). Maybe a UDF called `uuid_version`
or `uuid_timestamp` that extracts the embedded version or timestamp off of a
UUID type (and a `uuid()` generating function)? (pyarrow and DuckDB both
understand the arrow.uuid extension type out of the box which facilitates a
nice interchange example where the uuid-ness isn't lost at the edges).
The arbitrary key/value metadata use case is cool too (and I get that it's
the use case that motivated this whole thing from your end!) but it's harder to
find an in-the-wild example where a user can leverage this out of the box. The
places I have run into this in the wild are basically data sources that write
things there (like perhaps rerun) whose provider didn't know about extension
types (e.g., the API Snowflake uses to get data from the server to its Python
connector uses field metadata to communicate the Snowflake type information,
whereas BigQuery's Arrow API uses extension types to communicate its type
information).
##########
content/blog/2025-06-09-metadata-handling.md:
##########
@@ -0,0 +1,98 @@
+---
+layout: post
+title: Metadata handling in user defined functions
+date: 2025-06-09
+author: Tim Saucer
+categories: [core]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+
+[DataFusion 48.0.0] introduced a change in the interface for writing custom
functions
+which enables a variety of interesting improvements. Now users can access
additional
+data about the input columns to functions, such as their nullability and
metadata. This
+enables processing of extension types as well as a wide variety of other use
cases.
+
+TODO: UPDATE LINKS
+
+[DataFusion 48.0.0]: https://github.com/apache/datafusion/tree/48.0.0-rc3
+
+# Why metadata handling is important
+
+Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays.
Each
+[Field] in this `Schema` contains a name, data type, nullability, and
metadata. The
+metadata is specified as a map of key-value pairs of strings. In the new
+implementation, during processing of all user defined functions we pass the
input
+field information.
+
+It is often desirable to write a generic function for reuse. With the prior
version of
+user defined functions, we only had access to the `DataType` of the input
columns. This
+works well for some features that only rely on the types of data. Other use
cases may
+need additional information that describes the data.
+
+For example, suppose I write a function that computes the force of gravity on
an object
+based on it's mass. The general equation is `F = m * g` where `g = 9.8 m/s`.
Suppose
+our documentation for the function specifies the output will be in Newtons.
This is only
+valid if the input unit is in kilograms. With our metadata enhancement, we
could update
+this function to now evaluate the input units, perform any kind of required
+transformation, and give consistent output every time. We could also have the
function
+return an error if an invalid input was given, such as providing an input
where the
+metadata says the units are in `meters` instead of a unit of mass.
+
+One common application of metadata handling is understanding encoding of a
blob of data.
+Suppose you have a column that contains image data. You could use metadata to
specify
+the encoding of the image data so you could use the appropriate decoder.
+
+[field]: https://arrow.apache.org/docs/format/Glossary.html#term-field
+
+# How to use metadata in user defined functions
+
+Using input metadata occurs in two different phases of a user defined
function. Both during
+the planning phase and execution, we have access to these field information.
This allows
+the user to determine the appropriate output fields during planning and to
validate the
+input. For other use cases, it may only be necessary to access these fields
during execution.
+We leave this open to the user.
+
+For all types of user defined functions we now evaluate the output [Field] as
well. You can
+specify this to create your own metadata from your functions or to pass
through metadata from
+one or more of your inputs.
+
+In addition to metadata the input field information carries nullability. With
these you can
+create more expressive nullability of your output data instead of having a
single output.
+For example, you could write a function to convert a string to uppercase. If
we know the
+input field is non-nullable, then we can set the output field to non-nullable
as well.
+
+# Extension types
+
+TODO
+
+# Working with literals
+
+TODO
Review Comment:
The place where I use this is finding my values in optimizer rules (for
example, a `cast(uuid_val, String)` could be replaced with a function that
prettifies the UUID in the way UUIDs are usualy prettified). That's perhaps too
complex for this post (perhaps this section doesn't need an example).
##########
content/blog/2025-06-09-metadata-handling.md:
##########
@@ -0,0 +1,98 @@
+---
+layout: post
+title: Metadata handling in user defined functions
Review Comment:
Maybe: "Field metadata and extension type support in user defined functions"?
##########
content/blog/2025-06-09-metadata-handling.md:
##########
@@ -0,0 +1,98 @@
+---
+layout: post
+title: Metadata handling in user defined functions
+date: 2025-06-09
+author: Tim Saucer
+categories: [core]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+
+[DataFusion 48.0.0] introduced a change in the interface for writing custom
functions
+which enables a variety of interesting improvements. Now users can access
additional
+data about the input columns to functions, such as their nullability and
metadata. This
+enables processing of extension types as well as a wide variety of other use
cases.
+
+TODO: UPDATE LINKS
+
+[DataFusion 48.0.0]: https://github.com/apache/datafusion/tree/48.0.0-rc3
+
+# Why metadata handling is important
+
+Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays.
Each
+[Field] in this `Schema` contains a name, data type, nullability, and
metadata. The
+metadata is specified as a map of key-value pairs of strings. In the new
+implementation, during processing of all user defined functions we pass the
input
+field information.
+
+It is often desirable to write a generic function for reuse. With the prior
version of
+user defined functions, we only had access to the `DataType` of the input
columns. This
+works well for some features that only rely on the types of data. Other use
cases may
+need additional information that describes the data.
+
+For example, suppose I write a function that computes the force of gravity on
an object
+based on it's mass. The general equation is `F = m * g` where `g = 9.8 m/s`.
Suppose
+our documentation for the function specifies the output will be in Newtons.
This is only
+valid if the input unit is in kilograms. With our metadata enhancement, we
could update
+this function to now evaluate the input units, perform any kind of required
+transformation, and give consistent output every time. We could also have the
function
+return an error if an invalid input was given, such as providing an input
where the
+metadata says the units are in `meters` instead of a unit of mass.
+
+One common application of metadata handling is understanding encoding of a
blob of data.
+Suppose you have a column that contains image data. You could use metadata to
specify
+the encoding of the image data so you could use the appropriate decoder.
+
+[field]: https://arrow.apache.org/docs/format/Glossary.html#term-field
+
+# How to use metadata in user defined functions
+
+Using input metadata occurs in two different phases of a user defined
function. Both during
+the planning phase and execution, we have access to these field information.
This allows
+the user to determine the appropriate output fields during planning and to
validate the
+input. For other use cases, it may only be necessary to access these fields
during execution.
+We leave this open to the user.
+
+For all types of user defined functions we now evaluate the output [Field] as
well. You can
+specify this to create your own metadata from your functions or to pass
through metadata from
+one or more of your inputs.
+
+In addition to metadata the input field information carries nullability. With
these you can
+create more expressive nullability of your output data instead of having a
single output.
+For example, you could write a function to convert a string to uppercase. If
we know the
+input field is non-nullable, then we can set the output field to non-nullable
as well.
+
Review Comment:
Perhaps the first example could be high-level Python (where pyarrow takes
care of the field metadata automagically), and this example could be Rust
(where we'd have to check the content of the fields and/or assign them.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]