This is an automated email from the ASF dual-hosted git repository.
github-bot pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/datafusion-site.git
The following commit(s) were added to refs/heads/asf-site by this push:
new cf25209 Commit build products
cf25209 is described below
commit cf252093d678a2c8956efbaff7e0699879684252
Author: Build Pelican (action) <[email protected]>
AuthorDate: Tue Sep 23 17:11:48 2025 +0000
Commit build products
---
.../09/21/custom-types-using-metadata/index.html | 370 +++++++++++++++++++++
...-dunningtonwherobots-andrew-lambinfluxdata.html | 64 ++++
output/category/blog.html | 32 ++
output/feed.xml | 24 +-
output/feeds/all-en.atom.xml | 272 ++++++++++++++-
output/feeds/blog.atom.xml | 272 ++++++++++++++-
...ningtonwherobots-andrew-lambinfluxdata.atom.xml | 272 +++++++++++++++
...nningtonwherobots-andrew-lambinfluxdata.rss.xml | 24 ++
.../metadata-handling/arrow_record_batch.png | Bin 0 -> 224968 bytes
output/index.html | 41 +++
10 files changed, 1368 insertions(+), 3 deletions(-)
diff --git a/output/2025/09/21/custom-types-using-metadata/index.html
b/output/2025/09/21/custom-types-using-metadata/index.html
new file mode 100644
index 0000000..816136c
--- /dev/null
+++ b/output/2025/09/21/custom-types-using-metadata/index.html
@@ -0,0 +1,370 @@
+<!doctype html>
+<html class="no-js" lang="en" dir="ltr">
+ <head>
+ <meta charset="utf-8">
+ <meta http-equiv="x-ua-compatible" content="ie=edge">
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
+ <title>Implementing User Defined Types and Custom Metadata in DataFusion -
Apache DataFusion Blog</title>
+<link href="/blog/css/bootstrap.min.css" rel="stylesheet">
+<link href="/blog/css/fontawesome.all.min.css" rel="stylesheet">
+<link href="/blog/css/headerlink.css" rel="stylesheet">
+<link href="/blog/highlight/default.min.css" rel="stylesheet">
+<link href="/blog/css/app.css" rel="stylesheet">
+<script src="/blog/highlight/highlight.js"></script>
+<script>hljs.highlightAll();</script> </head>
+ <body class="d-flex flex-column h-100">
+ <main class="flex-shrink-0">
+<!-- nav bar -->
+<nav class="navbar navbar-expand-lg navbar-dark bg-dark" aria-label="Fifth
navbar example">
+ <div class="container-fluid">
+ <a class="navbar-brand" href="/blog"><img
src="/blog/images/logo_original4x.png" style="height: 32px;"/> Apache
DataFusion Blog</a>
+ <button class="navbar-toggler" type="button" data-bs-toggle="collapse"
data-bs-target="#navbarADP" aria-controls="navbarADP" aria-expanded="false"
aria-label="Toggle navigation">
+ <span class="navbar-toggler-icon"></span>
+ </button>
+
+ <div class="collapse navbar-collapse" id="navbarADP">
+ <ul class="navbar-nav me-auto mb-2 mb-lg-0">
+ <li class="nav-item">
+ <a class="nav-link" href="/blog/about.html">About</a>
+ </li>
+ <li class="nav-item">
+ <a class="nav-link" href="/blog/feed.xml">RSS</a>
+ </li>
+ </ul>
+ </div>
+ </div>
+</nav>
+<!-- article contents -->
+<div id="contents">
+ <div class="bg-white p-4 p-md-5 rounded">
+ <div class="row justify-content-center">
+ <div class="col-12 col-md-8 main-content">
+ <h1>
+ Implementing User Defined Types and Custom Metadata in DataFusion
+ </h1>
+ <p>Posted on: Sun 21 September 2025 by Tim Saucer(rerun.io), Dewey
Dunnington(Wherobots), Andrew Lamb(InfluxData)</p>
+
+ <aside class="toc-container d-md-none mb-2">
+ <div class="toc"><span class="toctitle">Contents</span><ul>
+<li><a href="#user-defined-types-extension-types">User defined types ==
extension types</a></li>
+<li><a href="#metadata-in-apache-arrow-fields">Metadata in Apache Arrow
Fields</a></li>
+<li><a href="#metadata-handling">Metadata handling</a></li>
+<li><a href="#how-to-use-metadata-in-user-defined-functions">How to use
metadata in user defined functions</a></li>
+<li><a href="#extension-types">Extension types</a></li>
+<li><a href="#other-use-cases">Other use cases</a></li>
+<li><a href="#acknowledgements">Acknowledgements</a></li>
+<li><a href="#conclusion">Conclusion</a></li>
+<li><a href="#get-involved">Get Involved</a></li>
+</ul>
+</div>
+ </aside>
+
+ <!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+
+<p><a
href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">Apache
DataFusion</a> significantly improves support for user
+defined types and metadata. The user defined function APIs let users access
+metadata on the input columns to functions and produce metadata in the
output.</p>
+<h2 id="user-defined-types-extension-types">User defined types == extension
types<a class="headerlink" href="#user-defined-types-extension-types"
title="Permanent link">¶</a></h2>
+<p>DataFusion directly uses <a href="https://arrow.apache.org">Apache
Arrow</a>'s <a
href="https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html">DataTypes</a>
as its type system. This
+has several benefits including being simple to explain, supports a rich set of
+both scalar and nested types, true zero copy interoperability with other Arrow
+implementations, and world-class library support (via <a
href="https://github.com/apache/arrow-rs">arrow-rs</a>). However, one
+challenge of directly using the Arrow type system is there is no distinction
+between logical types and physical types. For example, the Arrow type system
+contains multiple types which can store "String"s (sequences of UTF8 encoded
+bytes) such as <code>Utf8</code>, <code>LargeUTF8</code>,
<code>Dictionary(Utf8)</code>, and <code>Utf8View</code>. </p>
+<p>However, Apache Arrow does provide <a
href="https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types">extension
types</a>, a version of logical type
+information, which describe how to interpret data stored in one of the existing
+physical types. With the improved support for metadata in DataFusion 48.0.0, it
+is now easier to implement user defined types using Arrow extension types.</p>
+<h2 id="metadata-in-apache-arrow-fields">Metadata in Apache Arrow
<code>Field</code>s<a class="headerlink"
href="#metadata-in-apache-arrow-fields" title="Permanent link">¶</a></h2>
+<p>The <a href="https://arrow.apache.org/docs/format/Columnar.html">Arrow
specification</a> defines Metadata as a map of key-value pairs of
+strings. This metadata is used to attach extension types and use case-specific
+context to a column of values. The Rust implementation of Apache Arrow,
+<a href="https://github.com/apache/arrow-rs">arrow-rs</a>, stores metadata on
<a
href="https://arrow.apache.org/docs/format/Glossary.html#term-field">Field</a>s,
but prior to DataFusion 48.0.0, many of
+DataFusion's internal APIs used <a
href="https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html">DataTypes</a>
directly, and thus did not propagate
+metadata through all operations.</p>
+<p>In previous versions of DataFusion <code>Field</code> metadata was
propagated through certain
+operations (e.g., renaming or selecting a column) but was not
+others (e.g., scalar, window, or aggregate function calls). In DataFusion
48.0.0,
+and later, all user defined functions are passed the full
+input <code>Field</code> information and can return <code>Field</code>
information to the caller.</p>
+<p>Supporting extension types was a key motivation for adding metadata to the
+function processing, the same mechanism can store arbitrary metadata on the
+input and output fields, which supports other interesting use cases as we
describe
+later in this post.</p>
+<h2 id="metadata-handling">Metadata handling<a class="headerlink"
href="#metadata-handling" title="Permanent link">¶</a></h2>
+<p>Data in Arrow record batches carry a <a
href="https://docs.rs/arrow/latest/arrow/datatypes/struct.Schema.html">Schema</a>
in addition to the Arrow arrays. Each
+<a
href="https://arrow.apache.org/docs/format/Glossary.html#term-field">Field</a>
in this <code>Schema</code> contains a name, data type, nullability, and
metadata. The
+metadata is specified as a map of key-value pairs of strings. In the new
+implementation, during processing of all user defined functions we pass the
input
+field information.</p>
+<figure>
+<img alt="Relationship between a Record Batch, it's schema, and the underlying
arrays. There is a one to one relationship between each Field in the Schema and
Array entry in the Columns." class="img-responsive"
src="/blog/images/metadata-handling/arrow_record_batch.png" width="100%"/>
+<figcaption>
+<b>Figure 1:</b> Relationship between a Record Batch, it's schema, and the
underlying arrays. There is a one to one relationship between each Field in the
Schema and Array entry in the Columns.
+ </figcaption>
+</figure>
+<p>It is often desirable to write a generic function for reuse. Prior versions
of
+user defined functions only had access to the <code>DataType</code> of the
input columns.
+This works well for some features that only rely on the types of data, but
other
+use cases may need additional information that describes the data.</p>
+<p>For example, suppose I wish to write a function that takes in a UUID and
returns a string
+of the <a href="https://www.ietf.org/rfc/rfc9562.html#section-4.1">variant</a>
of the input field. We would want this function to be able to handle
+all of the string types and also a binary encoded UUID. The Arrow
specification does not
+contain a unsigned 128 bit value, it is common to encode a UUID as a fixed
sized binary
+array where each element is 16 bytes long. With the metadata handling in
[DataFusion 48.0.0]
+we can validate during planning that the input data not only has the correct
underlying
+data type, but that it also represents the right <em>kind</em> of data. The
UUID example is a
+common one, and it is included in the <a
href="https://arrow.apache.org/docs/format/CanonicalExtensions.html">canonical
extension types</a> that are now
+supported in DataFusion.</p>
+<p>Another common application of metadata handling is understanding encoding
of a blob of data.
+Suppose you have a column that contains image data. Most likely this data is
stored as
+an array of <code>u8</code> data. Without knowing a priori what the encoding
of that blob of data is,
+you cannot ensure you are using the correct methods for decoding it. You may
work around
+this by adding another column to your data source indicating the encoding, but
this can be
+wasteful for systems where the encoding never changes. Instead, you could use
metadata to
+specify the encoding for the entire column.</p>
+<h2 id="how-to-use-metadata-in-user-defined-functions">How to use metadata in
user defined functions<a class="headerlink"
href="#how-to-use-metadata-in-user-defined-functions" title="Permanent
link">¶</a></h2>
+<p>When working with metadata for <a
href="https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html">user
defined scalar functions</a>, there are typically two
+places in the function definition that require implementation.</p>
+<ul>
+<li>Computing the return field from the arguments</li>
+<li>Invocation</li>
+</ul>
+<p>During planning, we will attempt to call the function <a
href="https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#method.return_field_from_args">return_field_from_args()</a>.
This will
+provide a list of input fields to the function and return the output field. To
evaluate
+metadata on the input side, you can write a functions similar to this
example:</p>
+<pre><code class="language-rust">fn return_field_from_args(
+ &self,
+ args: ReturnFieldArgs,
+) -> datafusion::common::Result<FieldRef> {
+ if args.arg_fields.len() != 1 {
+ return exec_err!("Incorrect number of arguments for uuid_version");
+ }
+
+ let input_field = &args.arg_fields[0];
+ if &DataType::FixedSizeBinary(16) == input_field.data_type() {
+ let Ok(CanonicalExtensionType::Uuid(_)) =
input_field.try_canonical_extension_type()
+ else {
+ return exec_err!("Input field must contain the UUID canonical
extension type");
+ };
+ }
+
+ let is_nullable = args.arg_fields[0].is_nullable();
+
+ Ok(Arc::new(Field::new(self.name(), DataType::UInt32, is_nullable)))
+}
+</code></pre>
+<p>In this example, we take advantage of the fact that we already have support
for extension
+types that evaluate metadata. If you were attempting to check for metadata
other than
+extension type support, we could have instead written a snippet such as:</p>
+<pre><code class="language-rust"> if &DataType::FixedSizeBinary(16) ==
input_field.data_type() {
+ let _ = input_field
+ .metadata()
+ .get("ARROW:extension:metadata")
+ .ok_or(exec_datafusion_err!("Input field must contain the UUID
canonical extension type"))?;
+ };
+ }
+</code></pre>
+<p>If you are writing a user defined function that will instead return
metadata on output
+you can add this directly into the <code>Field</code> that is the output of
the <code>return_field_from_args</code>
+call. In our above example, we could change the return line to:</p>
+<pre><code class="language-rust"> Ok(Arc::new(
+ Field::new(self.name(), DataType::UInt32, is_nullable).with_metadata(
+ [("my_key".to_string(), "my_value".to_string())]
+ .into_iter()
+ .collect(),
+ ),
+ ))
+</code></pre>
+<p>By checking the metadata during the planning process, we can identify
errors early in
+the query process. There are cases were we wish to have access to this
metadata during
+execution as well. The function <a
href="https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#tymethod.invoke_with_args">invoke_with_args</a>
in the user defined function takes
+the updated struct <a
href="https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.ScalarFunctionArgs.html">ScalarFunctionArgs</a>.
This now contains the input fields, which can
+be used to check for metadata. For example, you can do the following:</p>
+<pre><code class="language-rust">fn invoke_with_args(&self, args:
ScalarFunctionArgs) -> Result<ColumnarValue> {
+ assert_eq!(args.arg_fields.len(), 1);
+ let my_value = args.arg_fields[0]
+ .metadata()
+ .get("encoding_type");
+ ...
+</code></pre>
+<p>In this snippet we have extracted an <code>Option<String></code> from
the input field metadata
+which we can then use to determine which functions we might want to call. We
could
+then parse the returned value to determine what type of encoding to use when
+evaluating the array in the arguments. Since
<code>return_field_from_args</code> is not <code>&mut self</code>
+this check could not be performed during the planning stage.</p>
+<p>The description in this section applies to scalar user defined functions,
but equivalent
+support exists for aggregate and window functions.</p>
+<h2 id="extension-types">Extension types<a class="headerlink"
href="#extension-types" title="Permanent link">¶</a></h2>
+<p>Extension types are one of the primary motivations for this enhancement in
+[Datafusion 48.0.0]. The official Rust implementation of Apache Arrow, <a
href="https://github.com/apache/arrow-rs">arrow-rs</a>,
+already contains support for the <a
href="https://arrow.apache.org/docs/format/CanonicalExtensions.html">canonical
extension types</a>. This support includes
+helper functions such as <code>try_canonical_extension_type()</code> in the
earlier example.</p>
+<p>For a concrete example of how extension types can be used in DataFusion
functions,
+there is an <a
href="https://github.com/timsaucer/datafusion_extension_type_examples">example
repository</a> that demonstrates using UUIDs. The UUID extension
+type specifies that the data are stored as a Fixed Size Binary of length 16.
In the
+DataFusion core functions, we have the ability to generate string
representations of
+UUIDs that match the version 4 specification. These are helpful, but a user may
+wish to do additional work with UUIDs where having them in the dense
representation
+is preferable. Alternatively, the user may already have data with the binary
encoding
+and we want to extract values such as the version, timestamp, or string
+representation.</p>
+<p>In the example repository we have created three user defined functions:
<code>UuidVersion</code>,
+<code>StringToUuid</code>, and <code>UuidToString</code>. Each of these
implements <code>ScalarUDFImpl</code> and can
+be used thusly:</p>
+<pre><code class="language-rust">async fn main() -> Result<()> {
+ let ctx = create_context()?;
+
+ // get a DataFrame from the context
+ let mut df = ctx.table("t").await?;
+
+ // Create the string UUIDs
+ df = df.select(vec![uuid().alias("string_uuid")])?;
+
+ // Convert string UUIDs to canonical extension UUIDs
+ let string_to_uuid = ScalarUDF::new_from_impl(StringToUuid::default());
+ df = df.with_column("uuid",
string_to_uuid.call(vec![col("string_uuid")]))?;
+
+ // Extract version number from canonical extension UUIDs
+ let version = ScalarUDF::new_from_impl(UuidVersion::default());
+ df = df.with_column("version", version.call(vec![col("uuid")]))?;
+
+ // Convert back to a string
+ let uuid_to_string = ScalarUDF::new_from_impl(UuidToString::default());
+ df = df.with_column("string_round_trip",
uuid_to_string.call(vec![col("uuid")]))?;
+
+ df.show().await?;
+
+ Ok(())
+}
+</code></pre>
+<p>The <a
href="https://github.com/timsaucer/datafusion_extension_type_examples">example
repository</a> also contains a crate that demonstrates how to expose these
+UDFs to <a href="https://datafusion.apache.org/python/">datafusion-python</a>.
This requires version 48.0.0 or later.</p>
+<h2 id="other-use-cases">Other use cases<a class="headerlink"
href="#other-use-cases" title="Permanent link">¶</a></h2>
+<p>The metadata attached to the fields can be used to store <em>any</em> user
data in key/value
+pairs. Some of the other use cases that have been identified include:</p>
+<ul>
+<li>Creating output for downstream systems. One user of DataFusion produces
+ <a href="https://rerun.io/blog/column-chunks">data visualizations</a> that
are dependant upon metadata in record batch fields. By
+ enabling metadata on output of user defined functions, we can now produce
batches
+ that are directly consumable by these systems.</li>
+<li>Describe the relationships between columns of data. You can store data
about how
+ one column of data relates to another and use these during function
evaluation. For
+ example, in robotics it is common to use <a
href="https://wiki.ros.org/tf2">transforms</a> to describe how to convert
+ from one coordinate system to another. It can be convenient to send the
function
+ all the columns that contain transform information and then allow the
function
+ to determine which columns to use based on the metadata. This allows for
+ encapsulation of the transform logic within the user function.</li>
+<li>Storing logical types of the data model. <a
href="https://docs.influxdata.com/influxdb/v1/concepts/schema_and_data_layout/">InfluxDB</a>
uses field metadata to specify
+ which columns are used for tags, times, and fields.</li>
+</ul>
+<p>Based on the experience of the authors, we recommend caution when using
metadata
+for use cases other than type extension. One issue that can arises is that as
columns
+are used to compute new fields, some functions may pass through the metadata
and the
+semantic meaning may change. For example, suppose you decided to use metadata
to
+store some kind of statistics for the entire stream of record batches. Then
you pass
+that column through a filter that removes many rows of data. Your statistics
+metadata may now be invalid, even though it was passed through the filter.</p>
+<p>Similarly, if you use metadata to form relations between one column and
another and
+the naming of the columns has changed at some point in your workflow, then the
metadata
+may indicate an incorrect column of data it is referring to. This can be
mitigated by
+not relying on column naming but rather adding additional metadata to all
columns of
+interest.</p>
+<h2 id="acknowledgements">Acknowledgements<a class="headerlink"
href="#acknowledgements" title="Permanent link">¶</a></h2>
+<p>We would like to thank <a href="https://rerun.io">Rerun.io</a> for
sponsoring the development of this work. <a href="https://rerun.io">Rerun.io</a>
+is building a data visualization system for Physical AI and uses metadata to
specify
+context about columns in Arrow record batches.</p>
+<h2 id="conclusion">Conclusion<a class="headerlink" href="#conclusion"
title="Permanent link">¶</a></h2>
+<p>The enhanced metadata handling in [DataFusion 48.0.0] is a significant step
+forward in the ability to handle more interesting types of data. Users can
+validate the input data matches the intent of the data to be processed, enable
+complex operations on binary data because we understand the encoding used, and
+use metadata to create new and interesting user defined data types.
+We can't wait to see what you build with it!</p>
+<h2 id="get-involved">Get Involved<a class="headerlink" href="#get-involved"
title="Permanent link">¶</a></h2>
+<p>The DataFusion team is an active and engaging community and we would love
to have you join
+us and help the project.</p>
+<p>Here are some ways to get involved:</p>
+<ul>
+<li>Learn more by visiting the <a
href="https://datafusion.apache.org/index.html">DataFusion</a> project
page.</li>
+<li>Try out the project and provide feedback, file issues, and contribute
code.</li>
+<li>Work on a <a
href="https://github.com/apache/datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22">good
first issue</a>.</li>
+<li>Reach out to us via the <a
href="https://datafusion.apache.org/contributor-guide/communication.html">communication
doc</a>.</li>
+</ul>
+
+<!--
+ Comments Section
+ Loaded only after explicit visitor consent to comply with ASF policy.
+-->
+
+<div id="comments">
+ <hr>
+ <h3>Comments</h3>
+
+ <!-- Local loader script -->
+ <script src="/content/js/giscus-consent.js" defer></script>
+
+ <!-- Consent UI -->
+ <div id="giscus-consent">
+ <p>
+ We use <a href="https://giscus.app/">Giscus</a> for comments, powered
by GitHub Discussions.
+ To respect your privacy, Giscus and comments will load only if you
click "Show Comments"
+ </p>
+
+ <div class="consent-actions">
+ <button id="giscus-load" type="button">Show Comments</button>
+ <button id="giscus-revoke" type="button" hidden>Hide Comments</button>
+ </div>
+
+ <noscript>JavaScript is required to load comments from Giscus.</noscript>
+ </div>
+
+ <!-- Container where Giscus will render -->
+ <div id="comment-thread"></div>
+</div> </div>
+ <aside class="toc-container d-none d-md-block col-md-4 col-xl-3 ms-xl-2">
+ <div class="toc"><span class="toctitle">Contents</span><ul>
+<li><a href="#user-defined-types-extension-types">User defined types ==
extension types</a></li>
+<li><a href="#metadata-in-apache-arrow-fields">Metadata in Apache Arrow
Fields</a></li>
+<li><a href="#metadata-handling">Metadata handling</a></li>
+<li><a href="#how-to-use-metadata-in-user-defined-functions">How to use
metadata in user defined functions</a></li>
+<li><a href="#extension-types">Extension types</a></li>
+<li><a href="#other-use-cases">Other use cases</a></li>
+<li><a href="#acknowledgements">Acknowledgements</a></li>
+<li><a href="#conclusion">Conclusion</a></li>
+<li><a href="#get-involved">Get Involved</a></li>
+</ul>
+</div>
+ </aside>
+ </div>
+ </div>
+</div>
+ <!-- footer -->
+ <div class="row g-0">
+ <div class="col-12">
+ <p style="font-style: italic; font-size: 0.8rem; text-align: center;">
+ Copyright 2025, <a href="https://www.apache.org/">The Apache
Software Foundation</a>, Licensed under the <a
href="https://www.apache.org/licenses/LICENSE-2.0">Apache License, Version
2.0</a>.<br/>
+ Apache® and the Apache feather logo are trademarks of The Apache
Software Foundation.
+ </p>
+ </div>
+ </div>
+ <script src="/blog/js/bootstrap.bundle.min.js"></script> </main>
+ </body>
+</html>
diff --git
a/output/author/tim-saucerrerunio-dewey-dunningtonwherobots-andrew-lambinfluxdata.html
b/output/author/tim-saucerrerunio-dewey-dunningtonwherobots-andrew-lambinfluxdata.html
new file mode 100644
index 0000000..e03bb2b
--- /dev/null
+++
b/output/author/tim-saucerrerunio-dewey-dunningtonwherobots-andrew-lambinfluxdata.html
@@ -0,0 +1,64 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+ <title>Apache DataFusion Blog - Articles by Tim Saucer(rerun.io),
Dewey Dunnington(Wherobots), Andrew Lamb(InfluxData)</title>
+ <meta charset="utf-8" />
+ <meta name="generator" content="Pelican" />
+ <link href="https://datafusion.apache.org/blog/feed.xml"
type="application/rss+xml" rel="alternate" title="Apache DataFusion Blog RSS
Feed" />
+</head>
+
+<body id="index" class="home">
+ <header id="banner" class="body">
+ <h1><a href="https://datafusion.apache.org/blog/">Apache
DataFusion Blog</a></h1>
+ </header><!-- /#banner -->
+ <nav id="menu"><ul>
+ <li><a
href="https://datafusion.apache.org/blog/pages/about.html">About</a></li>
+ <li><a
href="https://datafusion.apache.org/blog/pages/index.html">index</a></li>
+ <li><a
href="https://datafusion.apache.org/blog/category/blog.html">blog</a></li>
+ </ul></nav><!-- /#menu -->
+<section id="content">
+<h2>Articles by Tim Saucer(rerun.io), Dewey Dunnington(Wherobots), Andrew
Lamb(InfluxData)</h2>
+
+<ol id="post-list">
+ <li><article class="hentry">
+ <header> <h2 class="entry-title"><a
href="https://datafusion.apache.org/blog/2025/09/21/custom-types-using-metadata"
rel="bookmark" title="Permalink to Implementing User Defined Types and Custom
Metadata in DataFusion">Implementing User Defined Types and Custom Metadata in
DataFusion</a></h2> </header>
+ <footer class="post-info">
+ <time class="published"
datetime="2025-09-21T00:00:00+00:00"> Sun 21 September 2025 </time>
+ <address class="vcard author">By
+ <a class="url fn"
href="https://datafusion.apache.org/blog/author/tim-saucerrerunio-dewey-dunningtonwherobots-andrew-lambinfluxdata.html">Tim
Saucer(rerun.io), Dewey Dunnington(Wherobots), Andrew Lamb(InfluxData)</a>
+ </address>
+ </footer><!-- /.post-info -->
+ <div class="entry-content"> <!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+
+<p><a
href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">Apache
DataFusion</a> significantly improves support for user
+defined types and metadata. The user defined function APIs let users access
+metadata on the input columns to functions and produce metadata in the
output.</p>
+<h2 id="user-defined-types-extension-types">User defined types == extension
types<a class="headerlink" href="#user-defined-types-extension-types"
title="Permanent link">¶</a></h2>
+<p>DataFusion directly uses <a href="https://arrow.apache.org">Apache
Arrow</a>'s <a
href="https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html">DataTypes</a>
as its type system. This
+has …</p> </div><!-- /.entry-content -->
+ </article></li>
+</ol><!-- /#posts-list -->
+</section><!-- /#content -->
+ <footer id="contentinfo" class="body">
+ <address id="about" class="vcard body">
+ Proudly powered by <a
href="https://getpelican.com/">Pelican</a>,
+ which takes great advantage of <a
href="https://www.python.org/">Python</a>.
+ </address><!-- /#about -->
+ </footer><!-- /#contentinfo -->
+</body>
+</html>
\ No newline at end of file
diff --git a/output/category/blog.html b/output/category/blog.html
index 1b7eb1a..eb412a0 100644
--- a/output/category/blog.html
+++ b/output/category/blog.html
@@ -21,6 +21,38 @@
<h2>Articles in the blog category</h2>
<ol id="post-list">
+ <li><article class="hentry">
+ <header> <h2 class="entry-title"><a
href="https://datafusion.apache.org/blog/2025/09/21/custom-types-using-metadata"
rel="bookmark" title="Permalink to Implementing User Defined Types and Custom
Metadata in DataFusion">Implementing User Defined Types and Custom Metadata in
DataFusion</a></h2> </header>
+ <footer class="post-info">
+ <time class="published"
datetime="2025-09-21T00:00:00+00:00"> Sun 21 September 2025 </time>
+ <address class="vcard author">By
+ <a class="url fn"
href="https://datafusion.apache.org/blog/author/tim-saucerrerunio-dewey-dunningtonwherobots-andrew-lambinfluxdata.html">Tim
Saucer(rerun.io), Dewey Dunnington(Wherobots), Andrew Lamb(InfluxData)</a>
+ </address>
+ </footer><!-- /.post-info -->
+ <div class="entry-content"> <!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+
+<p><a
href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">Apache
DataFusion</a> significantly improves support for user
+defined types and metadata. The user defined function APIs let users access
+metadata on the input columns to functions and produce metadata in the
output.</p>
+<h2 id="user-defined-types-extension-types">User defined types == extension
types<a class="headerlink" href="#user-defined-types-extension-types"
title="Permanent link">¶</a></h2>
+<p>DataFusion directly uses <a href="https://arrow.apache.org">Apache
Arrow</a>'s <a
href="https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html">DataTypes</a>
as its type system. This
+has …</p> </div><!-- /.entry-content -->
+ </article></li>
<li><article class="hentry">
<header> <h2 class="entry-title"><a
href="https://datafusion.apache.org/blog/2025/09/16/datafusion-comet-0.10.0"
rel="bookmark" title="Permalink to Apache DataFusion Comet 0.10.0
Release">Apache DataFusion Comet 0.10.0 Release</a></h2> </header>
<footer class="post-info">
diff --git a/output/feed.xml b/output/feed.xml
index 9e545dd..5e9f341 100644
--- a/output/feed.xml
+++ b/output/feed.xml
@@ -1,5 +1,27 @@
<?xml version="1.0" encoding="utf-8"?>
-<rss version="2.0"><channel><title>Apache DataFusion
Blog</title><link>https://datafusion.apache.org/blog/</link><description></description><lastBuildDate>Tue,
16 Sep 2025 00:00:00 +0000</lastBuildDate><item><title>Apache DataFusion Comet
0.10.0
Release</title><link>https://datafusion.apache.org/blog/2025/09/16/datafusion-comet-0.10.0</link><description><!--
+<rss version="2.0"><channel><title>Apache DataFusion
Blog</title><link>https://datafusion.apache.org/blog/</link><description></description><lastBuildDate>Sun,
21 Sep 2025 00:00:00 +0000</lastBuildDate><item><title>Implementing User
Defined Types and Custom Metadata in
DataFusion</title><link>https://datafusion.apache.org/blog/2025/09/21/custom-types-using-metadata</link><description><!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+
+<p><a
href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">Apache
DataFusion</a> significantly improves support for user
+defined types and metadata. The user defined function APIs let users access
+metadata on the input columns to functions and produce metadata in the
output.</p>
+<h2 id="user-defined-types-extension-types">User defined types ==
extension types<a class="headerlink"
href="#user-defined-types-extension-types" title="Permanent
link">¶</a></h2>
+<p>DataFusion directly uses <a
href="https://arrow.apache.org">Apache Arrow</a>'s <a
href="https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html">DataTypes</a>
as its type system. This
+has …</p></description><dc:creator
xmlns:dc="http://purl.org/dc/elements/1.1/">Tim Saucer(rerun.io), Dewey
Dunnington(Wherobots), Andrew Lamb(InfluxData)</dc:creator><pubDate>Sun, 21 Sep
2025 00:00:00 +0000</pubDate><guid
isPermaLink="false">tag:datafusion.apache.org,2025-09-21:/blog/2025/09/21/custom-types-using-metadata</guid><category>blog</category></item><item><title>Apache
DataFusion Comet 0.10.0
Release</title><link>https://datafusion.apache.org/blog/2025/09/16/datafusion-co
[...]
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
diff --git a/output/feeds/all-en.atom.xml b/output/feeds/all-en.atom.xml
index c27f401..e216160 100644
--- a/output/feeds/all-en.atom.xml
+++ b/output/feeds/all-en.atom.xml
@@ -1,5 +1,275 @@
<?xml version="1.0" encoding="utf-8"?>
-<feed xmlns="http://www.w3.org/2005/Atom"><title>Apache DataFusion
Blog</title><link href="https://datafusion.apache.org/blog/"
rel="alternate"></link><link
href="https://datafusion.apache.org/blog/feeds/all-en.atom.xml"
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-09-16T00:00:00+00:00</updated><subtitle></subtitle><entry><title>Apache
DataFusion Comet 0.10.0 Release</title><link
href="https://datafusion.apache.org/blog/2025/09/16/datafusion-comet-0.10.0" r
[...]
+<feed xmlns="http://www.w3.org/2005/Atom"><title>Apache DataFusion
Blog</title><link href="https://datafusion.apache.org/blog/"
rel="alternate"></link><link
href="https://datafusion.apache.org/blog/feeds/all-en.atom.xml"
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-09-21T00:00:00+00:00</updated><subtitle></subtitle><entry><title>Implementing
User Defined Types and Custom Metadata in DataFusion</title><link
href="https://datafusion.apache.org/blog/2025/09/21 [...]
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+
+<p><a
href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">Apache
DataFusion</a> significantly improves support for user
+defined types and metadata. The user defined function APIs let users access
+metadata on the input columns to functions and produce metadata in the
output.</p>
+<h2 id="user-defined-types-extension-types">User defined types ==
extension types<a class="headerlink"
href="#user-defined-types-extension-types" title="Permanent
link">¶</a></h2>
+<p>DataFusion directly uses <a
href="https://arrow.apache.org">Apache Arrow</a>'s <a
href="https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html">DataTypes</a>
as its type system. This
+has …</p></summary><content type="html"><!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+
+<p><a
href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">Apache
DataFusion</a> significantly improves support for user
+defined types and metadata. The user defined function APIs let users access
+metadata on the input columns to functions and produce metadata in the
output.</p>
+<h2 id="user-defined-types-extension-types">User defined types ==
extension types<a class="headerlink"
href="#user-defined-types-extension-types" title="Permanent
link">¶</a></h2>
+<p>DataFusion directly uses <a
href="https://arrow.apache.org">Apache Arrow</a>'s <a
href="https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html">DataTypes</a>
as its type system. This
+has several benefits including being simple to explain, supports a rich set of
+both scalar and nested types, true zero copy interoperability with other Arrow
+implementations, and world-class library support (via <a
href="https://github.com/apache/arrow-rs">arrow-rs</a>). However, one
+challenge of directly using the Arrow type system is there is no distinction
+between logical types and physical types. For example, the Arrow type system
+contains multiple types which can store "String"s (sequences of UTF8 encoded
+bytes) such as <code>Utf8</code>,
<code>LargeUTF8</code>, <code>Dictionary(Utf8)</code>,
and <code>Utf8View</code>. </p>
+<p>However, Apache Arrow does provide <a
href="https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types">extension
types</a>, a version of logical type
+information, which describe how to interpret data stored in one of the existing
+physical types. With the improved support for metadata in DataFusion 48.0.0, it
+is now easier to implement user defined types using Arrow extension
types.</p>
+<h2 id="metadata-in-apache-arrow-fields">Metadata in Apache Arrow
<code>Field</code>s<a class="headerlink"
href="#metadata-in-apache-arrow-fields" title="Permanent
link">¶</a></h2>
+<p>The <a
href="https://arrow.apache.org/docs/format/Columnar.html">Arrow
specification</a> defines Metadata as a map of key-value pairs of
+strings. This metadata is used to attach extension types and use case-specific
+context to a column of values. The Rust implementation of Apache Arrow,
+<a href="https://github.com/apache/arrow-rs">arrow-rs</a>, stores
metadata on <a
href="https://arrow.apache.org/docs/format/Glossary.html#term-field">Field</a>s,
but prior to DataFusion 48.0.0, many of
+DataFusion's internal APIs used <a
href="https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html">DataTypes</a>
directly, and thus did not propagate
+metadata through all operations.</p>
+<p>In previous versions of DataFusion <code>Field</code>
metadata was propagated through certain
+operations (e.g., renaming or selecting a column) but was not
+others (e.g., scalar, window, or aggregate function calls). In DataFusion
48.0.0,
+and later, all user defined functions are passed the full
+input <code>Field</code> information and can return
<code>Field</code> information to the caller.</p>
+<p>Supporting extension types was a key motivation for adding metadata
to the
+function processing, the same mechanism can store arbitrary metadata on the
+input and output fields, which supports other interesting use cases as we
describe
+later in this post.</p>
+<h2 id="metadata-handling">Metadata handling<a class="headerlink"
href="#metadata-handling" title="Permanent link">¶</a></h2>
+<p>Data in Arrow record batches carry a <a
href="https://docs.rs/arrow/latest/arrow/datatypes/struct.Schema.html">Schema</a>
in addition to the Arrow arrays. Each
+<a
href="https://arrow.apache.org/docs/format/Glossary.html#term-field">Field</a>
in this <code>Schema</code> contains a name, data type,
nullability, and metadata. The
+metadata is specified as a map of key-value pairs of strings. In the new
+implementation, during processing of all user defined functions we pass the
input
+field information.</p>
+<figure>
+<img alt="Relationship between a Record Batch, it's schema, and the
underlying arrays. There is a one to one relationship between each Field in the
Schema and Array entry in the Columns." class="img-responsive"
src="/blog/images/metadata-handling/arrow_record_batch.png" width="100%"/>
+<figcaption>
+<b>Figure 1:</b> Relationship between a Record Batch, it's schema,
and the underlying arrays. There is a one to one relationship between each
Field in the Schema and Array entry in the Columns.
+ </figcaption>
+</figure>
+<p>It is often desirable to write a generic function for reuse. Prior
versions of
+user defined functions only had access to the
<code>DataType</code> of the input columns.
+This works well for some features that only rely on the types of data, but
other
+use cases may need additional information that describes the data.</p>
+<p>For example, suppose I wish to write a function that takes in a UUID
and returns a string
+of the <a
href="https://www.ietf.org/rfc/rfc9562.html#section-4.1">variant</a>
of the input field. We would want this function to be able to handle
+all of the string types and also a binary encoded UUID. The Arrow
specification does not
+contain a unsigned 128 bit value, it is common to encode a UUID as a fixed
sized binary
+array where each element is 16 bytes long. With the metadata handling in
[DataFusion 48.0.0]
+we can validate during planning that the input data not only has the correct
underlying
+data type, but that it also represents the right <em>kind</em> of
data. The UUID example is a
+common one, and it is included in the <a
href="https://arrow.apache.org/docs/format/CanonicalExtensions.html">canonical
extension types</a> that are now
+supported in DataFusion.</p>
+<p>Another common application of metadata handling is understanding
encoding of a blob of data.
+Suppose you have a column that contains image data. Most likely this data is
stored as
+an array of <code>u8</code> data. Without knowing a priori what
the encoding of that blob of data is,
+you cannot ensure you are using the correct methods for decoding it. You may
work around
+this by adding another column to your data source indicating the encoding, but
this can be
+wasteful for systems where the encoding never changes. Instead, you could use
metadata to
+specify the encoding for the entire column.</p>
+<h2 id="how-to-use-metadata-in-user-defined-functions">How to use
metadata in user defined functions<a class="headerlink"
href="#how-to-use-metadata-in-user-defined-functions" title="Permanent
link">¶</a></h2>
+<p>When working with metadata for <a
href="https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html">user
defined scalar functions</a>, there are typically two
+places in the function definition that require implementation.</p>
+<ul>
+<li>Computing the return field from the arguments</li>
+<li>Invocation</li>
+</ul>
+<p>During planning, we will attempt to call the function <a
href="https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#method.return_field_from_args">return_field_from_args()</a>.
This will
+provide a list of input fields to the function and return the output field. To
evaluate
+metadata on the input side, you can write a functions similar to this
example:</p>
+<pre><code class="language-rust">fn return_field_from_args(
+ &amp;self,
+ args: ReturnFieldArgs,
+) -&gt; datafusion::common::Result&lt;FieldRef&gt; {
+ if args.arg_fields.len() != 1 {
+ return exec_err!("Incorrect number of arguments for uuid_version");
+ }
+
+ let input_field = &amp;args.arg_fields[0];
+ if &amp;DataType::FixedSizeBinary(16) == input_field.data_type() {
+ let Ok(CanonicalExtensionType::Uuid(_)) =
input_field.try_canonical_extension_type()
+ else {
+ return exec_err!("Input field must contain the UUID canonical
extension type");
+ };
+ }
+
+ let is_nullable = args.arg_fields[0].is_nullable();
+
+ Ok(Arc::new(Field::new(self.name(), DataType::UInt32, is_nullable)))
+}
+</code></pre>
+<p>In this example, we take advantage of the fact that we already have
support for extension
+types that evaluate metadata. If you were attempting to check for metadata
other than
+extension type support, we could have instead written a snippet such
as:</p>
+<pre><code class="language-rust"> if
&amp;DataType::FixedSizeBinary(16) == input_field.data_type() {
+ let _ = input_field
+ .metadata()
+ .get("ARROW:extension:metadata")
+ .ok_or(exec_datafusion_err!("Input field must contain the UUID
canonical extension type"))?;
+ };
+ }
+</code></pre>
+<p>If you are writing a user defined function that will instead return
metadata on output
+you can add this directly into the <code>Field</code> that is the
output of the <code>return_field_from_args</code>
+call. In our above example, we could change the return line to:</p>
+<pre><code class="language-rust"> Ok(Arc::new(
+ Field::new(self.name(), DataType::UInt32, is_nullable).with_metadata(
+ [("my_key".to_string(), "my_value".to_string())]
+ .into_iter()
+ .collect(),
+ ),
+ ))
+</code></pre>
+<p>By checking the metadata during the planning process, we can identify
errors early in
+the query process. There are cases were we wish to have access to this
metadata during
+execution as well. The function <a
href="https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#tymethod.invoke_with_args">invoke_with_args</a>
in the user defined function takes
+the updated struct <a
href="https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.ScalarFunctionArgs.html">ScalarFunctionArgs</a>.
This now contains the input fields, which can
+be used to check for metadata. For example, you can do the following:</p>
+<pre><code class="language-rust">fn
invoke_with_args(&amp;self, args: ScalarFunctionArgs) -&gt;
Result&lt;ColumnarValue&gt; {
+ assert_eq!(args.arg_fields.len(), 1);
+ let my_value = args.arg_fields[0]
+ .metadata()
+ .get("encoding_type");
+ ...
+</code></pre>
+<p>In this snippet we have extracted an
<code>Option&lt;String&gt;</code> from the input field
metadata
+which we can then use to determine which functions we might want to call. We
could
+then parse the returned value to determine what type of encoding to use when
+evaluating the array in the arguments. Since
<code>return_field_from_args</code> is not <code>&amp;mut
self</code>
+this check could not be performed during the planning stage.</p>
+<p>The description in this section applies to scalar user defined
functions, but equivalent
+support exists for aggregate and window functions.</p>
+<h2 id="extension-types">Extension types<a class="headerlink"
href="#extension-types" title="Permanent link">¶</a></h2>
+<p>Extension types are one of the primary motivations for this
enhancement in
+[Datafusion 48.0.0]. The official Rust implementation of Apache Arrow, <a
href="https://github.com/apache/arrow-rs">arrow-rs</a>,
+already contains support for the <a
href="https://arrow.apache.org/docs/format/CanonicalExtensions.html">canonical
extension types</a>. This support includes
+helper functions such as
<code>try_canonical_extension_type()</code> in the earlier
example.</p>
+<p>For a concrete example of how extension types can be used in
DataFusion functions,
+there is an <a
href="https://github.com/timsaucer/datafusion_extension_type_examples">example
repository</a> that demonstrates using UUIDs. The UUID extension
+type specifies that the data are stored as a Fixed Size Binary of length 16.
In the
+DataFusion core functions, we have the ability to generate string
representations of
+UUIDs that match the version 4 specification. These are helpful, but a user may
+wish to do additional work with UUIDs where having them in the dense
representation
+is preferable. Alternatively, the user may already have data with the binary
encoding
+and we want to extract values such as the version, timestamp, or string
+representation.</p>
+<p>In the example repository we have created three user defined
functions: <code>UuidVersion</code>,
+<code>StringToUuid</code>, and
<code>UuidToString</code>. Each of these implements
<code>ScalarUDFImpl</code> and can
+be used thusly:</p>
+<pre><code class="language-rust">async fn main() -&gt;
Result&lt;()&gt; {
+ let ctx = create_context()?;
+
+ // get a DataFrame from the context
+ let mut df = ctx.table("t").await?;
+
+ // Create the string UUIDs
+ df = df.select(vec![uuid().alias("string_uuid")])?;
+
+ // Convert string UUIDs to canonical extension UUIDs
+ let string_to_uuid = ScalarUDF::new_from_impl(StringToUuid::default());
+ df = df.with_column("uuid",
string_to_uuid.call(vec![col("string_uuid")]))?;
+
+ // Extract version number from canonical extension UUIDs
+ let version = ScalarUDF::new_from_impl(UuidVersion::default());
+ df = df.with_column("version", version.call(vec![col("uuid")]))?;
+
+ // Convert back to a string
+ let uuid_to_string = ScalarUDF::new_from_impl(UuidToString::default());
+ df = df.with_column("string_round_trip",
uuid_to_string.call(vec![col("uuid")]))?;
+
+ df.show().await?;
+
+ Ok(())
+}
+</code></pre>
+<p>The <a
href="https://github.com/timsaucer/datafusion_extension_type_examples">example
repository</a> also contains a crate that demonstrates how to expose
these
+UDFs to <a
href="https://datafusion.apache.org/python/">datafusion-python</a>.
This requires version 48.0.0 or later.</p>
+<h2 id="other-use-cases">Other use cases<a class="headerlink"
href="#other-use-cases" title="Permanent link">¶</a></h2>
+<p>The metadata attached to the fields can be used to store
<em>any</em> user data in key/value
+pairs. Some of the other use cases that have been identified include:</p>
+<ul>
+<li>Creating output for downstream systems. One user of DataFusion
produces
+ <a href="https://rerun.io/blog/column-chunks">data
visualizations</a> that are dependant upon metadata in record batch
fields. By
+ enabling metadata on output of user defined functions, we can now produce
batches
+ that are directly consumable by these systems.</li>
+<li>Describe the relationships between columns of data. You can store
data about how
+ one column of data relates to another and use these during function
evaluation. For
+ example, in robotics it is common to use <a
href="https://wiki.ros.org/tf2">transforms</a> to describe how to
convert
+ from one coordinate system to another. It can be convenient to send the
function
+ all the columns that contain transform information and then allow the
function
+ to determine which columns to use based on the metadata. This allows for
+ encapsulation of the transform logic within the user function.</li>
+<li>Storing logical types of the data model. <a
href="https://docs.influxdata.com/influxdb/v1/concepts/schema_and_data_layout/">InfluxDB</a>
uses field metadata to specify
+ which columns are used for tags, times, and fields.</li>
+</ul>
+<p>Based on the experience of the authors, we recommend caution when
using metadata
+for use cases other than type extension. One issue that can arises is that as
columns
+are used to compute new fields, some functions may pass through the metadata
and the
+semantic meaning may change. For example, suppose you decided to use metadata
to
+store some kind of statistics for the entire stream of record batches. Then
you pass
+that column through a filter that removes many rows of data. Your statistics
+metadata may now be invalid, even though it was passed through the
filter.</p>
+<p>Similarly, if you use metadata to form relations between one column
and another and
+the naming of the columns has changed at some point in your workflow, then the
metadata
+may indicate an incorrect column of data it is referring to. This can be
mitigated by
+not relying on column naming but rather adding additional metadata to all
columns of
+interest.</p>
+<h2 id="acknowledgements">Acknowledgements<a class="headerlink"
href="#acknowledgements" title="Permanent link">¶</a></h2>
+<p>We would like to thank <a
href="https://rerun.io">Rerun.io</a> for sponsoring the development of
this work. <a href="https://rerun.io">Rerun.io</a>
+is building a data visualization system for Physical AI and uses metadata to
specify
+context about columns in Arrow record batches.</p>
+<h2 id="conclusion">Conclusion<a class="headerlink"
href="#conclusion" title="Permanent link">¶</a></h2>
+<p>The enhanced metadata handling in [DataFusion 48.0.0] is a
significant step
+forward in the ability to handle more interesting types of data. Users can
+validate the input data matches the intent of the data to be processed, enable
+complex operations on binary data because we understand the encoding used, and
+use metadata to create new and interesting user defined data types.
+We can't wait to see what you build with it!</p>
+<h2 id="get-involved">Get Involved<a class="headerlink"
href="#get-involved" title="Permanent link">¶</a></h2>
+<p>The DataFusion team is an active and engaging community and we would
love to have you join
+us and help the project.</p>
+<p>Here are some ways to get involved:</p>
+<ul>
+<li>Learn more by visiting the <a
href="https://datafusion.apache.org/index.html">DataFusion</a> project
page.</li>
+<li>Try out the project and provide feedback, file issues, and
contribute code.</li>
+<li>Work on a <a
href="https://github.com/apache/datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22">good
first issue</a>.</li>
+<li>Reach out to us via the <a
href="https://datafusion.apache.org/contributor-guide/communication.html">communication
doc</a>.</li>
+</ul></content><category
term="blog"></category></entry><entry><title>Apache DataFusion Comet 0.10.0
Release</title><link
href="https://datafusion.apache.org/blog/2025/09/16/datafusion-comet-0.10.0"
rel="alternate"></link><published>2025-09-16T00:00:00+00:00</published><updated>2025-09-16T00:00:00+00:00</updated><author><name>pmc</name></author><id>tag:datafusion.apache.org,2025-09-16:/blog/2025/09/16/datafusion-comet-0.10.0</id><summary
type="html"><!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
diff --git a/output/feeds/blog.atom.xml b/output/feeds/blog.atom.xml
index 424ab3f..1f83133 100644
--- a/output/feeds/blog.atom.xml
+++ b/output/feeds/blog.atom.xml
@@ -1,5 +1,275 @@
<?xml version="1.0" encoding="utf-8"?>
-<feed xmlns="http://www.w3.org/2005/Atom"><title>Apache DataFusion Blog -
blog</title><link href="https://datafusion.apache.org/blog/"
rel="alternate"></link><link
href="https://datafusion.apache.org/blog/feeds/blog.atom.xml"
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-09-16T00:00:00+00:00</updated><subtitle></subtitle><entry><title>Apache
DataFusion Comet 0.10.0 Release</title><link
href="https://datafusion.apache.org/blog/2025/09/16/datafusion-comet-0.10 [...]
+<feed xmlns="http://www.w3.org/2005/Atom"><title>Apache DataFusion Blog -
blog</title><link href="https://datafusion.apache.org/blog/"
rel="alternate"></link><link
href="https://datafusion.apache.org/blog/feeds/blog.atom.xml"
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-09-21T00:00:00+00:00</updated><subtitle></subtitle><entry><title>Implementing
User Defined Types and Custom Metadata in DataFusion</title><link
href="https://datafusion.apache.org/blog/2025/ [...]
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+
+<p><a
href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">Apache
DataFusion</a> significantly improves support for user
+defined types and metadata. The user defined function APIs let users access
+metadata on the input columns to functions and produce metadata in the
output.</p>
+<h2 id="user-defined-types-extension-types">User defined types ==
extension types<a class="headerlink"
href="#user-defined-types-extension-types" title="Permanent
link">¶</a></h2>
+<p>DataFusion directly uses <a
href="https://arrow.apache.org">Apache Arrow</a>'s <a
href="https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html">DataTypes</a>
as its type system. This
+has …</p></summary><content type="html"><!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+
+<p><a
href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">Apache
DataFusion</a> significantly improves support for user
+defined types and metadata. The user defined function APIs let users access
+metadata on the input columns to functions and produce metadata in the
output.</p>
+<h2 id="user-defined-types-extension-types">User defined types ==
extension types<a class="headerlink"
href="#user-defined-types-extension-types" title="Permanent
link">¶</a></h2>
+<p>DataFusion directly uses <a
href="https://arrow.apache.org">Apache Arrow</a>'s <a
href="https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html">DataTypes</a>
as its type system. This
+has several benefits including being simple to explain, supports a rich set of
+both scalar and nested types, true zero copy interoperability with other Arrow
+implementations, and world-class library support (via <a
href="https://github.com/apache/arrow-rs">arrow-rs</a>). However, one
+challenge of directly using the Arrow type system is there is no distinction
+between logical types and physical types. For example, the Arrow type system
+contains multiple types which can store "String"s (sequences of UTF8 encoded
+bytes) such as <code>Utf8</code>,
<code>LargeUTF8</code>, <code>Dictionary(Utf8)</code>,
and <code>Utf8View</code>. </p>
+<p>However, Apache Arrow does provide <a
href="https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types">extension
types</a>, a version of logical type
+information, which describe how to interpret data stored in one of the existing
+physical types. With the improved support for metadata in DataFusion 48.0.0, it
+is now easier to implement user defined types using Arrow extension
types.</p>
+<h2 id="metadata-in-apache-arrow-fields">Metadata in Apache Arrow
<code>Field</code>s<a class="headerlink"
href="#metadata-in-apache-arrow-fields" title="Permanent
link">¶</a></h2>
+<p>The <a
href="https://arrow.apache.org/docs/format/Columnar.html">Arrow
specification</a> defines Metadata as a map of key-value pairs of
+strings. This metadata is used to attach extension types and use case-specific
+context to a column of values. The Rust implementation of Apache Arrow,
+<a href="https://github.com/apache/arrow-rs">arrow-rs</a>, stores
metadata on <a
href="https://arrow.apache.org/docs/format/Glossary.html#term-field">Field</a>s,
but prior to DataFusion 48.0.0, many of
+DataFusion's internal APIs used <a
href="https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html">DataTypes</a>
directly, and thus did not propagate
+metadata through all operations.</p>
+<p>In previous versions of DataFusion <code>Field</code>
metadata was propagated through certain
+operations (e.g., renaming or selecting a column) but was not
+others (e.g., scalar, window, or aggregate function calls). In DataFusion
48.0.0,
+and later, all user defined functions are passed the full
+input <code>Field</code> information and can return
<code>Field</code> information to the caller.</p>
+<p>Supporting extension types was a key motivation for adding metadata
to the
+function processing, the same mechanism can store arbitrary metadata on the
+input and output fields, which supports other interesting use cases as we
describe
+later in this post.</p>
+<h2 id="metadata-handling">Metadata handling<a class="headerlink"
href="#metadata-handling" title="Permanent link">¶</a></h2>
+<p>Data in Arrow record batches carry a <a
href="https://docs.rs/arrow/latest/arrow/datatypes/struct.Schema.html">Schema</a>
in addition to the Arrow arrays. Each
+<a
href="https://arrow.apache.org/docs/format/Glossary.html#term-field">Field</a>
in this <code>Schema</code> contains a name, data type,
nullability, and metadata. The
+metadata is specified as a map of key-value pairs of strings. In the new
+implementation, during processing of all user defined functions we pass the
input
+field information.</p>
+<figure>
+<img alt="Relationship between a Record Batch, it's schema, and the
underlying arrays. There is a one to one relationship between each Field in the
Schema and Array entry in the Columns." class="img-responsive"
src="/blog/images/metadata-handling/arrow_record_batch.png" width="100%"/>
+<figcaption>
+<b>Figure 1:</b> Relationship between a Record Batch, it's schema,
and the underlying arrays. There is a one to one relationship between each
Field in the Schema and Array entry in the Columns.
+ </figcaption>
+</figure>
+<p>It is often desirable to write a generic function for reuse. Prior
versions of
+user defined functions only had access to the
<code>DataType</code> of the input columns.
+This works well for some features that only rely on the types of data, but
other
+use cases may need additional information that describes the data.</p>
+<p>For example, suppose I wish to write a function that takes in a UUID
and returns a string
+of the <a
href="https://www.ietf.org/rfc/rfc9562.html#section-4.1">variant</a>
of the input field. We would want this function to be able to handle
+all of the string types and also a binary encoded UUID. The Arrow
specification does not
+contain a unsigned 128 bit value, it is common to encode a UUID as a fixed
sized binary
+array where each element is 16 bytes long. With the metadata handling in
[DataFusion 48.0.0]
+we can validate during planning that the input data not only has the correct
underlying
+data type, but that it also represents the right <em>kind</em> of
data. The UUID example is a
+common one, and it is included in the <a
href="https://arrow.apache.org/docs/format/CanonicalExtensions.html">canonical
extension types</a> that are now
+supported in DataFusion.</p>
+<p>Another common application of metadata handling is understanding
encoding of a blob of data.
+Suppose you have a column that contains image data. Most likely this data is
stored as
+an array of <code>u8</code> data. Without knowing a priori what
the encoding of that blob of data is,
+you cannot ensure you are using the correct methods for decoding it. You may
work around
+this by adding another column to your data source indicating the encoding, but
this can be
+wasteful for systems where the encoding never changes. Instead, you could use
metadata to
+specify the encoding for the entire column.</p>
+<h2 id="how-to-use-metadata-in-user-defined-functions">How to use
metadata in user defined functions<a class="headerlink"
href="#how-to-use-metadata-in-user-defined-functions" title="Permanent
link">¶</a></h2>
+<p>When working with metadata for <a
href="https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html">user
defined scalar functions</a>, there are typically two
+places in the function definition that require implementation.</p>
+<ul>
+<li>Computing the return field from the arguments</li>
+<li>Invocation</li>
+</ul>
+<p>During planning, we will attempt to call the function <a
href="https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#method.return_field_from_args">return_field_from_args()</a>.
This will
+provide a list of input fields to the function and return the output field. To
evaluate
+metadata on the input side, you can write a functions similar to this
example:</p>
+<pre><code class="language-rust">fn return_field_from_args(
+ &amp;self,
+ args: ReturnFieldArgs,
+) -&gt; datafusion::common::Result&lt;FieldRef&gt; {
+ if args.arg_fields.len() != 1 {
+ return exec_err!("Incorrect number of arguments for uuid_version");
+ }
+
+ let input_field = &amp;args.arg_fields[0];
+ if &amp;DataType::FixedSizeBinary(16) == input_field.data_type() {
+ let Ok(CanonicalExtensionType::Uuid(_)) =
input_field.try_canonical_extension_type()
+ else {
+ return exec_err!("Input field must contain the UUID canonical
extension type");
+ };
+ }
+
+ let is_nullable = args.arg_fields[0].is_nullable();
+
+ Ok(Arc::new(Field::new(self.name(), DataType::UInt32, is_nullable)))
+}
+</code></pre>
+<p>In this example, we take advantage of the fact that we already have
support for extension
+types that evaluate metadata. If you were attempting to check for metadata
other than
+extension type support, we could have instead written a snippet such
as:</p>
+<pre><code class="language-rust"> if
&amp;DataType::FixedSizeBinary(16) == input_field.data_type() {
+ let _ = input_field
+ .metadata()
+ .get("ARROW:extension:metadata")
+ .ok_or(exec_datafusion_err!("Input field must contain the UUID
canonical extension type"))?;
+ };
+ }
+</code></pre>
+<p>If you are writing a user defined function that will instead return
metadata on output
+you can add this directly into the <code>Field</code> that is the
output of the <code>return_field_from_args</code>
+call. In our above example, we could change the return line to:</p>
+<pre><code class="language-rust"> Ok(Arc::new(
+ Field::new(self.name(), DataType::UInt32, is_nullable).with_metadata(
+ [("my_key".to_string(), "my_value".to_string())]
+ .into_iter()
+ .collect(),
+ ),
+ ))
+</code></pre>
+<p>By checking the metadata during the planning process, we can identify
errors early in
+the query process. There are cases were we wish to have access to this
metadata during
+execution as well. The function <a
href="https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#tymethod.invoke_with_args">invoke_with_args</a>
in the user defined function takes
+the updated struct <a
href="https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.ScalarFunctionArgs.html">ScalarFunctionArgs</a>.
This now contains the input fields, which can
+be used to check for metadata. For example, you can do the following:</p>
+<pre><code class="language-rust">fn
invoke_with_args(&amp;self, args: ScalarFunctionArgs) -&gt;
Result&lt;ColumnarValue&gt; {
+ assert_eq!(args.arg_fields.len(), 1);
+ let my_value = args.arg_fields[0]
+ .metadata()
+ .get("encoding_type");
+ ...
+</code></pre>
+<p>In this snippet we have extracted an
<code>Option&lt;String&gt;</code> from the input field
metadata
+which we can then use to determine which functions we might want to call. We
could
+then parse the returned value to determine what type of encoding to use when
+evaluating the array in the arguments. Since
<code>return_field_from_args</code> is not <code>&amp;mut
self</code>
+this check could not be performed during the planning stage.</p>
+<p>The description in this section applies to scalar user defined
functions, but equivalent
+support exists for aggregate and window functions.</p>
+<h2 id="extension-types">Extension types<a class="headerlink"
href="#extension-types" title="Permanent link">¶</a></h2>
+<p>Extension types are one of the primary motivations for this
enhancement in
+[Datafusion 48.0.0]. The official Rust implementation of Apache Arrow, <a
href="https://github.com/apache/arrow-rs">arrow-rs</a>,
+already contains support for the <a
href="https://arrow.apache.org/docs/format/CanonicalExtensions.html">canonical
extension types</a>. This support includes
+helper functions such as
<code>try_canonical_extension_type()</code> in the earlier
example.</p>
+<p>For a concrete example of how extension types can be used in
DataFusion functions,
+there is an <a
href="https://github.com/timsaucer/datafusion_extension_type_examples">example
repository</a> that demonstrates using UUIDs. The UUID extension
+type specifies that the data are stored as a Fixed Size Binary of length 16.
In the
+DataFusion core functions, we have the ability to generate string
representations of
+UUIDs that match the version 4 specification. These are helpful, but a user may
+wish to do additional work with UUIDs where having them in the dense
representation
+is preferable. Alternatively, the user may already have data with the binary
encoding
+and we want to extract values such as the version, timestamp, or string
+representation.</p>
+<p>In the example repository we have created three user defined
functions: <code>UuidVersion</code>,
+<code>StringToUuid</code>, and
<code>UuidToString</code>. Each of these implements
<code>ScalarUDFImpl</code> and can
+be used thusly:</p>
+<pre><code class="language-rust">async fn main() -&gt;
Result&lt;()&gt; {
+ let ctx = create_context()?;
+
+ // get a DataFrame from the context
+ let mut df = ctx.table("t").await?;
+
+ // Create the string UUIDs
+ df = df.select(vec![uuid().alias("string_uuid")])?;
+
+ // Convert string UUIDs to canonical extension UUIDs
+ let string_to_uuid = ScalarUDF::new_from_impl(StringToUuid::default());
+ df = df.with_column("uuid",
string_to_uuid.call(vec![col("string_uuid")]))?;
+
+ // Extract version number from canonical extension UUIDs
+ let version = ScalarUDF::new_from_impl(UuidVersion::default());
+ df = df.with_column("version", version.call(vec![col("uuid")]))?;
+
+ // Convert back to a string
+ let uuid_to_string = ScalarUDF::new_from_impl(UuidToString::default());
+ df = df.with_column("string_round_trip",
uuid_to_string.call(vec![col("uuid")]))?;
+
+ df.show().await?;
+
+ Ok(())
+}
+</code></pre>
+<p>The <a
href="https://github.com/timsaucer/datafusion_extension_type_examples">example
repository</a> also contains a crate that demonstrates how to expose
these
+UDFs to <a
href="https://datafusion.apache.org/python/">datafusion-python</a>.
This requires version 48.0.0 or later.</p>
+<h2 id="other-use-cases">Other use cases<a class="headerlink"
href="#other-use-cases" title="Permanent link">¶</a></h2>
+<p>The metadata attached to the fields can be used to store
<em>any</em> user data in key/value
+pairs. Some of the other use cases that have been identified include:</p>
+<ul>
+<li>Creating output for downstream systems. One user of DataFusion
produces
+ <a href="https://rerun.io/blog/column-chunks">data
visualizations</a> that are dependant upon metadata in record batch
fields. By
+ enabling metadata on output of user defined functions, we can now produce
batches
+ that are directly consumable by these systems.</li>
+<li>Describe the relationships between columns of data. You can store
data about how
+ one column of data relates to another and use these during function
evaluation. For
+ example, in robotics it is common to use <a
href="https://wiki.ros.org/tf2">transforms</a> to describe how to
convert
+ from one coordinate system to another. It can be convenient to send the
function
+ all the columns that contain transform information and then allow the
function
+ to determine which columns to use based on the metadata. This allows for
+ encapsulation of the transform logic within the user function.</li>
+<li>Storing logical types of the data model. <a
href="https://docs.influxdata.com/influxdb/v1/concepts/schema_and_data_layout/">InfluxDB</a>
uses field metadata to specify
+ which columns are used for tags, times, and fields.</li>
+</ul>
+<p>Based on the experience of the authors, we recommend caution when
using metadata
+for use cases other than type extension. One issue that can arises is that as
columns
+are used to compute new fields, some functions may pass through the metadata
and the
+semantic meaning may change. For example, suppose you decided to use metadata
to
+store some kind of statistics for the entire stream of record batches. Then
you pass
+that column through a filter that removes many rows of data. Your statistics
+metadata may now be invalid, even though it was passed through the
filter.</p>
+<p>Similarly, if you use metadata to form relations between one column
and another and
+the naming of the columns has changed at some point in your workflow, then the
metadata
+may indicate an incorrect column of data it is referring to. This can be
mitigated by
+not relying on column naming but rather adding additional metadata to all
columns of
+interest.</p>
+<h2 id="acknowledgements">Acknowledgements<a class="headerlink"
href="#acknowledgements" title="Permanent link">¶</a></h2>
+<p>We would like to thank <a
href="https://rerun.io">Rerun.io</a> for sponsoring the development of
this work. <a href="https://rerun.io">Rerun.io</a>
+is building a data visualization system for Physical AI and uses metadata to
specify
+context about columns in Arrow record batches.</p>
+<h2 id="conclusion">Conclusion<a class="headerlink"
href="#conclusion" title="Permanent link">¶</a></h2>
+<p>The enhanced metadata handling in [DataFusion 48.0.0] is a
significant step
+forward in the ability to handle more interesting types of data. Users can
+validate the input data matches the intent of the data to be processed, enable
+complex operations on binary data because we understand the encoding used, and
+use metadata to create new and interesting user defined data types.
+We can't wait to see what you build with it!</p>
+<h2 id="get-involved">Get Involved<a class="headerlink"
href="#get-involved" title="Permanent link">¶</a></h2>
+<p>The DataFusion team is an active and engaging community and we would
love to have you join
+us and help the project.</p>
+<p>Here are some ways to get involved:</p>
+<ul>
+<li>Learn more by visiting the <a
href="https://datafusion.apache.org/index.html">DataFusion</a> project
page.</li>
+<li>Try out the project and provide feedback, file issues, and
contribute code.</li>
+<li>Work on a <a
href="https://github.com/apache/datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22">good
first issue</a>.</li>
+<li>Reach out to us via the <a
href="https://datafusion.apache.org/contributor-guide/communication.html">communication
doc</a>.</li>
+</ul></content><category
term="blog"></category></entry><entry><title>Apache DataFusion Comet 0.10.0
Release</title><link
href="https://datafusion.apache.org/blog/2025/09/16/datafusion-comet-0.10.0"
rel="alternate"></link><published>2025-09-16T00:00:00+00:00</published><updated>2025-09-16T00:00:00+00:00</updated><author><name>pmc</name></author><id>tag:datafusion.apache.org,2025-09-16:/blog/2025/09/16/datafusion-comet-0.10.0</id><summary
type="html"><!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
diff --git
a/output/feeds/tim-saucerrerunio-dewey-dunningtonwherobots-andrew-lambinfluxdata.atom.xml
b/output/feeds/tim-saucerrerunio-dewey-dunningtonwherobots-andrew-lambinfluxdata.atom.xml
new file mode 100644
index 0000000..1243d75
--- /dev/null
+++
b/output/feeds/tim-saucerrerunio-dewey-dunningtonwherobots-andrew-lambinfluxdata.atom.xml
@@ -0,0 +1,272 @@
+<?xml version="1.0" encoding="utf-8"?>
+<feed xmlns="http://www.w3.org/2005/Atom"><title>Apache DataFusion Blog - Tim
Saucer(rerun.io), Dewey Dunnington(Wherobots), Andrew
Lamb(InfluxData)</title><link href="https://datafusion.apache.org/blog/"
rel="alternate"></link><link
href="https://datafusion.apache.org/blog/feeds/tim-saucerrerunio-dewey-dunningtonwherobots-andrew-lambinfluxdata.atom.xml"
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-09-21T00:00:00+00:00</updated><subtitle></subtitle><entry><
[...]
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+
+<p><a
href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">Apache
DataFusion</a> significantly improves support for user
+defined types and metadata. The user defined function APIs let users access
+metadata on the input columns to functions and produce metadata in the
output.</p>
+<h2 id="user-defined-types-extension-types">User defined types ==
extension types<a class="headerlink"
href="#user-defined-types-extension-types" title="Permanent
link">¶</a></h2>
+<p>DataFusion directly uses <a
href="https://arrow.apache.org">Apache Arrow</a>'s <a
href="https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html">DataTypes</a>
as its type system. This
+has …</p></summary><content type="html"><!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+
+<p><a
href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">Apache
DataFusion</a> significantly improves support for user
+defined types and metadata. The user defined function APIs let users access
+metadata on the input columns to functions and produce metadata in the
output.</p>
+<h2 id="user-defined-types-extension-types">User defined types ==
extension types<a class="headerlink"
href="#user-defined-types-extension-types" title="Permanent
link">¶</a></h2>
+<p>DataFusion directly uses <a
href="https://arrow.apache.org">Apache Arrow</a>'s <a
href="https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html">DataTypes</a>
as its type system. This
+has several benefits including being simple to explain, supports a rich set of
+both scalar and nested types, true zero copy interoperability with other Arrow
+implementations, and world-class library support (via <a
href="https://github.com/apache/arrow-rs">arrow-rs</a>). However, one
+challenge of directly using the Arrow type system is there is no distinction
+between logical types and physical types. For example, the Arrow type system
+contains multiple types which can store "String"s (sequences of UTF8 encoded
+bytes) such as <code>Utf8</code>,
<code>LargeUTF8</code>, <code>Dictionary(Utf8)</code>,
and <code>Utf8View</code>. </p>
+<p>However, Apache Arrow does provide <a
href="https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types">extension
types</a>, a version of logical type
+information, which describe how to interpret data stored in one of the existing
+physical types. With the improved support for metadata in DataFusion 48.0.0, it
+is now easier to implement user defined types using Arrow extension
types.</p>
+<h2 id="metadata-in-apache-arrow-fields">Metadata in Apache Arrow
<code>Field</code>s<a class="headerlink"
href="#metadata-in-apache-arrow-fields" title="Permanent
link">¶</a></h2>
+<p>The <a
href="https://arrow.apache.org/docs/format/Columnar.html">Arrow
specification</a> defines Metadata as a map of key-value pairs of
+strings. This metadata is used to attach extension types and use case-specific
+context to a column of values. The Rust implementation of Apache Arrow,
+<a href="https://github.com/apache/arrow-rs">arrow-rs</a>, stores
metadata on <a
href="https://arrow.apache.org/docs/format/Glossary.html#term-field">Field</a>s,
but prior to DataFusion 48.0.0, many of
+DataFusion's internal APIs used <a
href="https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html">DataTypes</a>
directly, and thus did not propagate
+metadata through all operations.</p>
+<p>In previous versions of DataFusion <code>Field</code>
metadata was propagated through certain
+operations (e.g., renaming or selecting a column) but was not
+others (e.g., scalar, window, or aggregate function calls). In DataFusion
48.0.0,
+and later, all user defined functions are passed the full
+input <code>Field</code> information and can return
<code>Field</code> information to the caller.</p>
+<p>Supporting extension types was a key motivation for adding metadata
to the
+function processing, the same mechanism can store arbitrary metadata on the
+input and output fields, which supports other interesting use cases as we
describe
+later in this post.</p>
+<h2 id="metadata-handling">Metadata handling<a class="headerlink"
href="#metadata-handling" title="Permanent link">¶</a></h2>
+<p>Data in Arrow record batches carry a <a
href="https://docs.rs/arrow/latest/arrow/datatypes/struct.Schema.html">Schema</a>
in addition to the Arrow arrays. Each
+<a
href="https://arrow.apache.org/docs/format/Glossary.html#term-field">Field</a>
in this <code>Schema</code> contains a name, data type,
nullability, and metadata. The
+metadata is specified as a map of key-value pairs of strings. In the new
+implementation, during processing of all user defined functions we pass the
input
+field information.</p>
+<figure>
+<img alt="Relationship between a Record Batch, it's schema, and the
underlying arrays. There is a one to one relationship between each Field in the
Schema and Array entry in the Columns." class="img-responsive"
src="/blog/images/metadata-handling/arrow_record_batch.png" width="100%"/>
+<figcaption>
+<b>Figure 1:</b> Relationship between a Record Batch, it's schema,
and the underlying arrays. There is a one to one relationship between each
Field in the Schema and Array entry in the Columns.
+ </figcaption>
+</figure>
+<p>It is often desirable to write a generic function for reuse. Prior
versions of
+user defined functions only had access to the
<code>DataType</code> of the input columns.
+This works well for some features that only rely on the types of data, but
other
+use cases may need additional information that describes the data.</p>
+<p>For example, suppose I wish to write a function that takes in a UUID
and returns a string
+of the <a
href="https://www.ietf.org/rfc/rfc9562.html#section-4.1">variant</a>
of the input field. We would want this function to be able to handle
+all of the string types and also a binary encoded UUID. The Arrow
specification does not
+contain a unsigned 128 bit value, it is common to encode a UUID as a fixed
sized binary
+array where each element is 16 bytes long. With the metadata handling in
[DataFusion 48.0.0]
+we can validate during planning that the input data not only has the correct
underlying
+data type, but that it also represents the right <em>kind</em> of
data. The UUID example is a
+common one, and it is included in the <a
href="https://arrow.apache.org/docs/format/CanonicalExtensions.html">canonical
extension types</a> that are now
+supported in DataFusion.</p>
+<p>Another common application of metadata handling is understanding
encoding of a blob of data.
+Suppose you have a column that contains image data. Most likely this data is
stored as
+an array of <code>u8</code> data. Without knowing a priori what
the encoding of that blob of data is,
+you cannot ensure you are using the correct methods for decoding it. You may
work around
+this by adding another column to your data source indicating the encoding, but
this can be
+wasteful for systems where the encoding never changes. Instead, you could use
metadata to
+specify the encoding for the entire column.</p>
+<h2 id="how-to-use-metadata-in-user-defined-functions">How to use
metadata in user defined functions<a class="headerlink"
href="#how-to-use-metadata-in-user-defined-functions" title="Permanent
link">¶</a></h2>
+<p>When working with metadata for <a
href="https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html">user
defined scalar functions</a>, there are typically two
+places in the function definition that require implementation.</p>
+<ul>
+<li>Computing the return field from the arguments</li>
+<li>Invocation</li>
+</ul>
+<p>During planning, we will attempt to call the function <a
href="https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#method.return_field_from_args">return_field_from_args()</a>.
This will
+provide a list of input fields to the function and return the output field. To
evaluate
+metadata on the input side, you can write a functions similar to this
example:</p>
+<pre><code class="language-rust">fn return_field_from_args(
+ &amp;self,
+ args: ReturnFieldArgs,
+) -&gt; datafusion::common::Result&lt;FieldRef&gt; {
+ if args.arg_fields.len() != 1 {
+ return exec_err!("Incorrect number of arguments for uuid_version");
+ }
+
+ let input_field = &amp;args.arg_fields[0];
+ if &amp;DataType::FixedSizeBinary(16) == input_field.data_type() {
+ let Ok(CanonicalExtensionType::Uuid(_)) =
input_field.try_canonical_extension_type()
+ else {
+ return exec_err!("Input field must contain the UUID canonical
extension type");
+ };
+ }
+
+ let is_nullable = args.arg_fields[0].is_nullable();
+
+ Ok(Arc::new(Field::new(self.name(), DataType::UInt32, is_nullable)))
+}
+</code></pre>
+<p>In this example, we take advantage of the fact that we already have
support for extension
+types that evaluate metadata. If you were attempting to check for metadata
other than
+extension type support, we could have instead written a snippet such
as:</p>
+<pre><code class="language-rust"> if
&amp;DataType::FixedSizeBinary(16) == input_field.data_type() {
+ let _ = input_field
+ .metadata()
+ .get("ARROW:extension:metadata")
+ .ok_or(exec_datafusion_err!("Input field must contain the UUID
canonical extension type"))?;
+ };
+ }
+</code></pre>
+<p>If you are writing a user defined function that will instead return
metadata on output
+you can add this directly into the <code>Field</code> that is the
output of the <code>return_field_from_args</code>
+call. In our above example, we could change the return line to:</p>
+<pre><code class="language-rust"> Ok(Arc::new(
+ Field::new(self.name(), DataType::UInt32, is_nullable).with_metadata(
+ [("my_key".to_string(), "my_value".to_string())]
+ .into_iter()
+ .collect(),
+ ),
+ ))
+</code></pre>
+<p>By checking the metadata during the planning process, we can identify
errors early in
+the query process. There are cases were we wish to have access to this
metadata during
+execution as well. The function <a
href="https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#tymethod.invoke_with_args">invoke_with_args</a>
in the user defined function takes
+the updated struct <a
href="https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.ScalarFunctionArgs.html">ScalarFunctionArgs</a>.
This now contains the input fields, which can
+be used to check for metadata. For example, you can do the following:</p>
+<pre><code class="language-rust">fn
invoke_with_args(&amp;self, args: ScalarFunctionArgs) -&gt;
Result&lt;ColumnarValue&gt; {
+ assert_eq!(args.arg_fields.len(), 1);
+ let my_value = args.arg_fields[0]
+ .metadata()
+ .get("encoding_type");
+ ...
+</code></pre>
+<p>In this snippet we have extracted an
<code>Option&lt;String&gt;</code> from the input field
metadata
+which we can then use to determine which functions we might want to call. We
could
+then parse the returned value to determine what type of encoding to use when
+evaluating the array in the arguments. Since
<code>return_field_from_args</code> is not <code>&amp;mut
self</code>
+this check could not be performed during the planning stage.</p>
+<p>The description in this section applies to scalar user defined
functions, but equivalent
+support exists for aggregate and window functions.</p>
+<h2 id="extension-types">Extension types<a class="headerlink"
href="#extension-types" title="Permanent link">¶</a></h2>
+<p>Extension types are one of the primary motivations for this
enhancement in
+[Datafusion 48.0.0]. The official Rust implementation of Apache Arrow, <a
href="https://github.com/apache/arrow-rs">arrow-rs</a>,
+already contains support for the <a
href="https://arrow.apache.org/docs/format/CanonicalExtensions.html">canonical
extension types</a>. This support includes
+helper functions such as
<code>try_canonical_extension_type()</code> in the earlier
example.</p>
+<p>For a concrete example of how extension types can be used in
DataFusion functions,
+there is an <a
href="https://github.com/timsaucer/datafusion_extension_type_examples">example
repository</a> that demonstrates using UUIDs. The UUID extension
+type specifies that the data are stored as a Fixed Size Binary of length 16.
In the
+DataFusion core functions, we have the ability to generate string
representations of
+UUIDs that match the version 4 specification. These are helpful, but a user may
+wish to do additional work with UUIDs where having them in the dense
representation
+is preferable. Alternatively, the user may already have data with the binary
encoding
+and we want to extract values such as the version, timestamp, or string
+representation.</p>
+<p>In the example repository we have created three user defined
functions: <code>UuidVersion</code>,
+<code>StringToUuid</code>, and
<code>UuidToString</code>. Each of these implements
<code>ScalarUDFImpl</code> and can
+be used thusly:</p>
+<pre><code class="language-rust">async fn main() -&gt;
Result&lt;()&gt; {
+ let ctx = create_context()?;
+
+ // get a DataFrame from the context
+ let mut df = ctx.table("t").await?;
+
+ // Create the string UUIDs
+ df = df.select(vec![uuid().alias("string_uuid")])?;
+
+ // Convert string UUIDs to canonical extension UUIDs
+ let string_to_uuid = ScalarUDF::new_from_impl(StringToUuid::default());
+ df = df.with_column("uuid",
string_to_uuid.call(vec![col("string_uuid")]))?;
+
+ // Extract version number from canonical extension UUIDs
+ let version = ScalarUDF::new_from_impl(UuidVersion::default());
+ df = df.with_column("version", version.call(vec![col("uuid")]))?;
+
+ // Convert back to a string
+ let uuid_to_string = ScalarUDF::new_from_impl(UuidToString::default());
+ df = df.with_column("string_round_trip",
uuid_to_string.call(vec![col("uuid")]))?;
+
+ df.show().await?;
+
+ Ok(())
+}
+</code></pre>
+<p>The <a
href="https://github.com/timsaucer/datafusion_extension_type_examples">example
repository</a> also contains a crate that demonstrates how to expose
these
+UDFs to <a
href="https://datafusion.apache.org/python/">datafusion-python</a>.
This requires version 48.0.0 or later.</p>
+<h2 id="other-use-cases">Other use cases<a class="headerlink"
href="#other-use-cases" title="Permanent link">¶</a></h2>
+<p>The metadata attached to the fields can be used to store
<em>any</em> user data in key/value
+pairs. Some of the other use cases that have been identified include:</p>
+<ul>
+<li>Creating output for downstream systems. One user of DataFusion
produces
+ <a href="https://rerun.io/blog/column-chunks">data
visualizations</a> that are dependant upon metadata in record batch
fields. By
+ enabling metadata on output of user defined functions, we can now produce
batches
+ that are directly consumable by these systems.</li>
+<li>Describe the relationships between columns of data. You can store
data about how
+ one column of data relates to another and use these during function
evaluation. For
+ example, in robotics it is common to use <a
href="https://wiki.ros.org/tf2">transforms</a> to describe how to
convert
+ from one coordinate system to another. It can be convenient to send the
function
+ all the columns that contain transform information and then allow the
function
+ to determine which columns to use based on the metadata. This allows for
+ encapsulation of the transform logic within the user function.</li>
+<li>Storing logical types of the data model. <a
href="https://docs.influxdata.com/influxdb/v1/concepts/schema_and_data_layout/">InfluxDB</a>
uses field metadata to specify
+ which columns are used for tags, times, and fields.</li>
+</ul>
+<p>Based on the experience of the authors, we recommend caution when
using metadata
+for use cases other than type extension. One issue that can arises is that as
columns
+are used to compute new fields, some functions may pass through the metadata
and the
+semantic meaning may change. For example, suppose you decided to use metadata
to
+store some kind of statistics for the entire stream of record batches. Then
you pass
+that column through a filter that removes many rows of data. Your statistics
+metadata may now be invalid, even though it was passed through the
filter.</p>
+<p>Similarly, if you use metadata to form relations between one column
and another and
+the naming of the columns has changed at some point in your workflow, then the
metadata
+may indicate an incorrect column of data it is referring to. This can be
mitigated by
+not relying on column naming but rather adding additional metadata to all
columns of
+interest.</p>
+<h2 id="acknowledgements">Acknowledgements<a class="headerlink"
href="#acknowledgements" title="Permanent link">¶</a></h2>
+<p>We would like to thank <a
href="https://rerun.io">Rerun.io</a> for sponsoring the development of
this work. <a href="https://rerun.io">Rerun.io</a>
+is building a data visualization system for Physical AI and uses metadata to
specify
+context about columns in Arrow record batches.</p>
+<h2 id="conclusion">Conclusion<a class="headerlink"
href="#conclusion" title="Permanent link">¶</a></h2>
+<p>The enhanced metadata handling in [DataFusion 48.0.0] is a
significant step
+forward in the ability to handle more interesting types of data. Users can
+validate the input data matches the intent of the data to be processed, enable
+complex operations on binary data because we understand the encoding used, and
+use metadata to create new and interesting user defined data types.
+We can't wait to see what you build with it!</p>
+<h2 id="get-involved">Get Involved<a class="headerlink"
href="#get-involved" title="Permanent link">¶</a></h2>
+<p>The DataFusion team is an active and engaging community and we would
love to have you join
+us and help the project.</p>
+<p>Here are some ways to get involved:</p>
+<ul>
+<li>Learn more by visiting the <a
href="https://datafusion.apache.org/index.html">DataFusion</a> project
page.</li>
+<li>Try out the project and provide feedback, file issues, and
contribute code.</li>
+<li>Work on a <a
href="https://github.com/apache/datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22">good
first issue</a>.</li>
+<li>Reach out to us via the <a
href="https://datafusion.apache.org/contributor-guide/communication.html">communication
doc</a>.</li>
+</ul></content><category term="blog"></category></entry></feed>
\ No newline at end of file
diff --git
a/output/feeds/tim-saucerrerunio-dewey-dunningtonwherobots-andrew-lambinfluxdata.rss.xml
b/output/feeds/tim-saucerrerunio-dewey-dunningtonwherobots-andrew-lambinfluxdata.rss.xml
new file mode 100644
index 0000000..1762954
--- /dev/null
+++
b/output/feeds/tim-saucerrerunio-dewey-dunningtonwherobots-andrew-lambinfluxdata.rss.xml
@@ -0,0 +1,24 @@
+<?xml version="1.0" encoding="utf-8"?>
+<rss version="2.0"><channel><title>Apache DataFusion Blog - Tim
Saucer(rerun.io), Dewey Dunnington(Wherobots), Andrew
Lamb(InfluxData)</title><link>https://datafusion.apache.org/blog/</link><description></description><lastBuildDate>Sun,
21 Sep 2025 00:00:00 +0000</lastBuildDate><item><title>Implementing User
Defined Types and Custom Metadata in
DataFusion</title><link>https://datafusion.apache.org/blog/2025/09/21/custom-types-using-metadata</link><description><!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+
+<p><a
href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">Apache
DataFusion</a> significantly improves support for user
+defined types and metadata. The user defined function APIs let users access
+metadata on the input columns to functions and produce metadata in the
output.</p>
+<h2 id="user-defined-types-extension-types">User defined types ==
extension types<a class="headerlink"
href="#user-defined-types-extension-types" title="Permanent
link">¶</a></h2>
+<p>DataFusion directly uses <a
href="https://arrow.apache.org">Apache Arrow</a>'s <a
href="https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html">DataTypes</a>
as its type system. This
+has …</p></description><dc:creator
xmlns:dc="http://purl.org/dc/elements/1.1/">Tim Saucer(rerun.io), Dewey
Dunnington(Wherobots), Andrew Lamb(InfluxData)</dc:creator><pubDate>Sun, 21 Sep
2025 00:00:00 +0000</pubDate><guid
isPermaLink="false">tag:datafusion.apache.org,2025-09-21:/blog/2025/09/21/custom-types-using-metadata</guid><category>blog</category></item></channel></rss>
\ No newline at end of file
diff --git a/output/images/metadata-handling/arrow_record_batch.png
b/output/images/metadata-handling/arrow_record_batch.png
new file mode 100644
index 0000000..d925b32
Binary files /dev/null and
b/output/images/metadata-handling/arrow_record_batch.png differ
diff --git a/output/index.html b/output/index.html
index 40fc132..a83b228 100644
--- a/output/index.html
+++ b/output/index.html
@@ -45,6 +45,47 @@
<p><i>Here you can find the latest updates from DataFusion and
related projects.</i></p>
+ <!-- Post -->
+ <div class="row">
+ <div class="callout">
+ <article class="post">
+ <header>
+ <div class="title">
+ <h1><a
href="/blog/2025/09/21/custom-types-using-metadata">Implementing User Defined
Types and Custom Metadata in DataFusion</a></h1>
+ <p>Posted on: Sun 21 September 2025 by Tim
Saucer(rerun.io), Dewey Dunnington(Wherobots), Andrew Lamb(InfluxData)</p>
+ <p><!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}x
+-->
+
+<p><a
href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">Apache
DataFusion</a> significantly improves support for user
+defined types and metadata. The user defined function APIs let users access
+metadata on the input columns to functions and produce metadata in the
output.</p>
+<h2 id="user-defined-types-extension-types">User defined types == extension
types<a class="headerlink" href="#user-defined-types-extension-types"
title="Permanent link">¶</a></h2>
+<p>DataFusion directly uses <a href="https://arrow.apache.org">Apache
Arrow</a>'s <a
href="https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html">DataTypes</a>
as its type system. This
+has …</p></p>
+ <footer>
+ <ul class="actions">
+ <div style="text-align: right"><a
href="/blog/2025/09/21/custom-types-using-metadata" class="button
medium">Continue Reading</a></div>
+ </ul>
+ <ul class="stats">
+ </ul>
+ </footer>
+ </article>
+ </div>
+ </div>
<!-- Post -->
<div class="row">
<div class="callout">
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]