Re: [PR] Metadata handling announcement [datafusion-site]

via GitHub Mon, 09 Jun 2025 09:12:43 -0700


paleolimbot commented on code in PR #73:
URL: https://github.com/apache/datafusion-site/pull/73#discussion_r2135925472



##########
content/blog/2025-06-09-metadata-handling.md:
##########
@@ -0,0 +1,98 @@
+---
+layout: post
+title: Metadata handling in user defined functions
+date: 2025-06-09
+author: Tim Saucer
+categories: [core]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+
+[DataFusion 48.0.0] introduced a change in the interface for writing custom 
functions
+which enables a variety of interesting improvements. Now users can access 
additional
+data about the input columns to functions, such as their nullability and 
metadata. This
+enables processing of extension types as well as a wide variety of other use 
cases.
+
+TODO: UPDATE LINKS
+
+[DataFusion 48.0.0]: https://github.com/apache/datafusion/tree/48.0.0-rc3
+
+# Why metadata handling is important
+
+Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. 
Each
+[Field] in this `Schema` contains a name, data type, nullability, and 
metadata. The
+metadata is specified as a map of key-value pairs of strings.  In the new
+implementation, during processing of all user defined functions we pass the 
input
+field information.

Review Comment:
   ```suggestion
   metadata is specified as a map of key-value pairs of strings. This extra 
metadata is used
   by Arrow implementations implement [extension types] and can also be used to 
add
   use case-specific context to a column of values where the formality of an 
extension type
   is not required. In previous versions of DataFusion field metadata was 
propagated through
   certain operations (e.g., renaming or selecting a column) but was not 
accessible to others
   (e.g., scalar, window, or aggregate function calls). In the new 
implementation, during
   processing of all user defined functions we pass the input field information 
and allow
   user defined function implementations to return field information to the 
caller.
   
   [extension types]: 
https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types
   ```



##########
content/blog/2025-06-09-metadata-handling.md:
##########
@@ -0,0 +1,98 @@
+---
+layout: post
+title: Metadata handling in user defined functions
+date: 2025-06-09
+author: Tim Saucer
+categories: [core]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+
+[DataFusion 48.0.0] introduced a change in the interface for writing custom 
functions
+which enables a variety of interesting improvements. Now users can access 
additional
+data about the input columns to functions, such as their nullability and 
metadata. This
+enables processing of extension types as well as a wide variety of other use 
cases.
+
+TODO: UPDATE LINKS
+
+[DataFusion 48.0.0]: https://github.com/apache/datafusion/tree/48.0.0-rc3
+
+# Why metadata handling is important
+
+Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. 
Each
+[Field] in this `Schema` contains a name, data type, nullability, and 
metadata. The
+metadata is specified as a map of key-value pairs of strings.  In the new
+implementation, during processing of all user defined functions we pass the 
input
+field information.
+
+It is often desirable to write a generic function for reuse. With the prior 
version of
+user defined functions, we only had access to the `DataType` of the input 
columns. This
+works well for some features that only rely on the types of data. Other use 
cases may
+need additional information that describes the data.
+
+For example, suppose I write a function that computes the force of gravity on 
an object
+based on it's mass. The general equation is `F = m * g` where `g = 9.8 m/s`. 
Suppose
+our documentation for the function specifies the output will be in Newtons. 
This is only
+valid if the input unit is in kilograms. With our metadata enhancement, we 
could update
+this function to now evaluate the input units, perform any kind of required
+transformation, and give consistent output every time. We could also have the 
function
+return an error if an invalid input was given, such as providing an input 
where the
+metadata says the units are in `meters` instead of a unit of mass.

Review Comment:
   I wonder if we could turn this into a code example with DataFusion(Python?) 
UDFs to make it more concrete (I can help). Maybe a UDF called `uuid_version` 
or `uuid_timestamp` that extracts the embedded version or timestamp off of a 
UUID type (and a `uuid()` generating function)? (pyarrow and DuckDB both 
understand the arrow.uuid extension type out of the box which facilitates a 
nice interchange example where the uuid-ness isn't lost at the edges).
   
   The arbitrary key/value metadata use case is cool too (and I get that it's 
the use case that motivated this whole thing from your end!) but it's harder to 
find an in-the-wild example where a user can leverage this out of the box. The 
places I have run into this in the wild are basically data sources that write 
things there (like perhaps rerun) whose provider didn't know about extension 
types (e.g., the API Snowflake uses to get data from the server to its Python 
connector uses field metadata to communicate the Snowflake type information, 
whereas BigQuery's Arrow API uses extension types to communicate its type 
information).



##########
content/blog/2025-06-09-metadata-handling.md:
##########
@@ -0,0 +1,98 @@
+---
+layout: post
+title: Metadata handling in user defined functions
+date: 2025-06-09
+author: Tim Saucer
+categories: [core]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+
+[DataFusion 48.0.0] introduced a change in the interface for writing custom 
functions
+which enables a variety of interesting improvements. Now users can access 
additional
+data about the input columns to functions, such as their nullability and 
metadata. This
+enables processing of extension types as well as a wide variety of other use 
cases.
+
+TODO: UPDATE LINKS
+
+[DataFusion 48.0.0]: https://github.com/apache/datafusion/tree/48.0.0-rc3
+
+# Why metadata handling is important
+
+Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. 
Each
+[Field] in this `Schema` contains a name, data type, nullability, and 
metadata. The
+metadata is specified as a map of key-value pairs of strings.  In the new
+implementation, during processing of all user defined functions we pass the 
input
+field information.
+
+It is often desirable to write a generic function for reuse. With the prior 
version of
+user defined functions, we only had access to the `DataType` of the input 
columns. This
+works well for some features that only rely on the types of data. Other use 
cases may
+need additional information that describes the data.
+
+For example, suppose I write a function that computes the force of gravity on 
an object
+based on it's mass. The general equation is `F = m * g` where `g = 9.8 m/s`. 
Suppose
+our documentation for the function specifies the output will be in Newtons. 
This is only
+valid if the input unit is in kilograms. With our metadata enhancement, we 
could update
+this function to now evaluate the input units, perform any kind of required
+transformation, and give consistent output every time. We could also have the 
function
+return an error if an invalid input was given, such as providing an input 
where the
+metadata says the units are in `meters` instead of a unit of mass.
+
+One common application of metadata handling is understanding encoding of a 
blob of data.
+Suppose you have a column that contains image data. You could use metadata to 
specify
+the encoding of the image data so you could use the appropriate decoder.
+
+[field]: https://arrow.apache.org/docs/format/Glossary.html#term-field
+
+# How to use metadata in user defined functions
+
+Using input metadata occurs in two different phases of a user defined 
function. Both during
+the planning phase and execution, we have access to these field information. 
This allows
+the user to determine the appropriate output fields during planning and to 
validate the
+input. For other use cases, it may only be necessary to access these fields 
during execution.
+We leave this open to the user.
+
+For all types of user defined functions we now evaluate the output [Field] as 
well. You can
+specify this to create your own metadata from your functions or to pass 
through metadata from
+one or more of your inputs.
+
+In addition to metadata the input field information carries nullability. With 
these you can
+create more expressive nullability of your output data instead of having a 
single output.
+For example, you could write a function to convert a string to uppercase. If 
we know the
+input field is non-nullable, then we can set the output field to non-nullable 
as well.
+
+# Extension types
+
+TODO
+
+# Working with literals
+
+TODO

Review Comment:
   The place where I use this is finding my values in optimizer rules (for 
example, a `cast(uuid_val, String)` could be replaced with a function that 
prettifies the UUID in the way UUIDs are usualy prettified). That's perhaps too 
complex for this post (perhaps this section doesn't need an example).



##########
content/blog/2025-06-09-metadata-handling.md:
##########
@@ -0,0 +1,98 @@
+---
+layout: post
+title: Metadata handling in user defined functions

Review Comment:
   Maybe: "Field metadata and extension type support in user defined functions"?



##########
content/blog/2025-06-09-metadata-handling.md:
##########
@@ -0,0 +1,98 @@
+---
+layout: post
+title: Metadata handling in user defined functions
+date: 2025-06-09
+author: Tim Saucer
+categories: [core]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+
+[DataFusion 48.0.0] introduced a change in the interface for writing custom 
functions
+which enables a variety of interesting improvements. Now users can access 
additional
+data about the input columns to functions, such as their nullability and 
metadata. This
+enables processing of extension types as well as a wide variety of other use 
cases.
+
+TODO: UPDATE LINKS
+
+[DataFusion 48.0.0]: https://github.com/apache/datafusion/tree/48.0.0-rc3
+
+# Why metadata handling is important
+
+Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. 
Each
+[Field] in this `Schema` contains a name, data type, nullability, and 
metadata. The
+metadata is specified as a map of key-value pairs of strings.  In the new
+implementation, during processing of all user defined functions we pass the 
input
+field information.
+
+It is often desirable to write a generic function for reuse. With the prior 
version of
+user defined functions, we only had access to the `DataType` of the input 
columns. This
+works well for some features that only rely on the types of data. Other use 
cases may
+need additional information that describes the data.
+
+For example, suppose I write a function that computes the force of gravity on 
an object
+based on it's mass. The general equation is `F = m * g` where `g = 9.8 m/s`. 
Suppose
+our documentation for the function specifies the output will be in Newtons. 
This is only
+valid if the input unit is in kilograms. With our metadata enhancement, we 
could update
+this function to now evaluate the input units, perform any kind of required
+transformation, and give consistent output every time. We could also have the 
function
+return an error if an invalid input was given, such as providing an input 
where the
+metadata says the units are in `meters` instead of a unit of mass.
+
+One common application of metadata handling is understanding encoding of a 
blob of data.
+Suppose you have a column that contains image data. You could use metadata to 
specify
+the encoding of the image data so you could use the appropriate decoder.
+
+[field]: https://arrow.apache.org/docs/format/Glossary.html#term-field
+
+# How to use metadata in user defined functions
+
+Using input metadata occurs in two different phases of a user defined 
function. Both during
+the planning phase and execution, we have access to these field information. 
This allows
+the user to determine the appropriate output fields during planning and to 
validate the
+input. For other use cases, it may only be necessary to access these fields 
during execution.
+We leave this open to the user.
+
+For all types of user defined functions we now evaluate the output [Field] as 
well. You can
+specify this to create your own metadata from your functions or to pass 
through metadata from
+one or more of your inputs.
+
+In addition to metadata the input field information carries nullability. With 
these you can
+create more expressive nullability of your output data instead of having a 
single output.
+For example, you could write a function to convert a string to uppercase. If 
we know the
+input field is non-nullable, then we can set the output field to non-nullable 
as well.
+

Review Comment:
   Perhaps the first example could be high-level Python (where pyarrow takes 
care of the field metadata automagically), and this example could be Rust 
(where we'd have to check the content of the fields and/or assign them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Metadata handling announcement [datafusion-site]

Reply via email to