(datafusion-site) 01/01: Starting draft for blog post about metadata handling

timsaucer Sun, 08 Jun 2025 13:48:40 -0700

This is an automated email from the ASF dual-hosted git repository.

timsaucer pushed a commit to branch site/metadata-handling
in repository https://gitbox.apache.org/repos/asf/datafusion-site.git


commit b07bfc1b1146015041c2a8d589fcbf558d5ee415
Author: Tim Saucer <timsau...@gmail.com>
AuthorDate: Sun Jun 8 16:44:54 2025 -0400

    Starting draft for blog post about metadata handling
---
 content/blog/2025-06-09-metadata-handling.md | 98 ++++++++++++++++++++++++++++
 1 file changed, 98 insertions(+)

diff --git a/content/blog/2025-06-09-metadata-handling.md 
b/content/blog/2025-06-09-metadata-handling.md
new file mode 100644
index 0000000..97e3f32
--- /dev/null
+++ b/content/blog/2025-06-09-metadata-handling.md
@@ -0,0 +1,98 @@
+---
+layout: post
+title: Metadata handling in user defined functions
+date: 2025-06-09
+author: Tim Saucer
+categories: [core]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+
+[DataFusion 48.0.0] introduced a change in the interface for writing custom 
functions
+which enables a variety of interesting improvements. Now users can access 
additional
+data about the input columns to functions, such as their nullability and 
metadata. This
+enables processing of extension types as well as a wide variety of other use 
cases.
+
+TODO: UPDATE LINKS
+
+[DataFusion 48.0.0]: https://github.com/apache/datafusion/tree/48.0.0-rc3
+
+# Why metadata handling is important
+
+Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. 
Each
+[Field] in this `Schema` contains a name, data type, nullability, and 
metadata. The
+metadata is specified as a map of key-value pairs of strings.  In the new
+implementation, during processing of all user defined functions we pass the 
input
+field information.
+
+It is often desirable to write a generic function for reuse. With the prior 
version of
+user defined functions, we only had access to the `DataType` of the input 
columns. This
+works well for some features that only rely on the types of data. Other use 
cases may
+need additional information that describes the data.
+
+For example, suppose I write a function that computes the force of gravity on 
an object
+based on it's mass. The general equation is `F = m * g` where `g = 9.8 m/s`. 
Suppose
+our documentation for the function specifies the output will be in Newtons. 
This is only
+valid if the input unit is in kilograms. With our metadata enhancement, we 
could update
+this function to now evaluate the input units, perform any kind of required
+transformation, and give consistent output every time. We could also have the 
function
+return an error if an invalid input was given, such as providing an input 
where the
+metadata says the units are in `meters` instead of a unit of mass.
+
+One common application of metadata handling is understanding encoding of a 
blob of data.
+Suppose you have a column that contains image data. You could use metadata to 
specify
+the encoding of the image data so you could use the appropriate decoder.
+
+[field]: https://arrow.apache.org/docs/format/Glossary.html#term-field
+
+# How to use metadata in user defined functions
+
+Using input metadata occurs in two different phases of a user defined 
function. Both during
+the planning phase and execution, we have access to these field information. 
This allows
+the user to determine the appropriate output fields during planning and to 
validate the
+input. For other use cases, it may only be necessary to access these fields 
during execution.
+We leave this open to the user.
+
+For all types of user defined functions we now evaluate the output [Field] as 
well. You can
+specify this to create your own metadata from your functions or to pass 
through metadata from
+one or more of your inputs.
+
+In addition to metadata the input field information carries nullability. With 
these you can
+create more expressive nullability of your output data instead of having a 
single output.
+For example, you could write a function to convert a string to uppercase. If 
we know the
+input field is non-nullable, then we can set the output field to non-nullable 
as well.
+
+# Extension types
+
+TODO
+
+# Working with literals
+
+TODO
+
+# Thanks to our sponsor
+
+We would like to thank [Rerun.io] for sponsoring the development of this work. 
[Rerun.io]
+is building a data visualization system for Physical AI and uses metadata to 
specify 
+context about columns in Arrow record batches.
+
+[Rerun.io]: https://rerun.io
+
+# Conclusion
+
+TODO


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@datafusion.apache.org
For additional commands, e-mail: commits-h...@datafusion.apache.org

(datafusion-site) 01/01: Starting draft for blog post about metadata handling

Reply via email to