This is an automated email from the ASF dual-hosted git repository. github-bot pushed a commit to branch asf-staging in repository https://gitbox.apache.org/repos/asf/datafusion-site.git
The following commit(s) were added to refs/heads/asf-staging by this push: new 3e08837 Commit build products 3e08837 is described below commit 3e08837afd2e091ca49398b818899b065524eea1 Author: Build Pelican (action) <priv...@infra.apache.org> AuthorDate: Tue Jul 29 22:45:12 2025 +0000 Commit build products --- .../{06/09 => 07/29}/metadata-handling/index.html | 2 +- .../tim-saucer-dewey-dunnington-andrew-lamb.html | 4 +- blog/category/blog.html | 58 +-- blog/feed.xml | 42 +- blog/feeds/all-en.atom.xml | 438 ++++++++++----------- blog/feeds/blog.atom.xml | 438 ++++++++++----------- ...im-saucer-dewey-dunnington-andrew-lamb.atom.xml | 2 +- ...tim-saucer-dewey-dunnington-andrew-lamb.rss.xml | 4 +- blog/index.html | 76 ++-- 9 files changed, 532 insertions(+), 532 deletions(-) diff --git a/blog/2025/06/09/metadata-handling/index.html b/blog/2025/07/29/metadata-handling/index.html similarity index 99% rename from blog/2025/06/09/metadata-handling/index.html rename to blog/2025/07/29/metadata-handling/index.html index 8a3ecaf..9d59e70 100644 --- a/blog/2025/06/09/metadata-handling/index.html +++ b/blog/2025/07/29/metadata-handling/index.html @@ -42,7 +42,7 @@ <h1> Field metadata and extension type support in user defined functions </h1> - <p>Posted on: Mon 09 June 2025 by Tim Saucer, Dewey Dunnington, Andrew Lamb</p> + <p>Posted on: Tue 29 July 2025 by Tim Saucer, Dewey Dunnington, Andrew Lamb</p> <!-- {% comment %} Licensed to the Apache Software Foundation (ASF) under one or more diff --git a/blog/author/tim-saucer-dewey-dunnington-andrew-lamb.html b/blog/author/tim-saucer-dewey-dunnington-andrew-lamb.html index 66deb91..21035cc 100644 --- a/blog/author/tim-saucer-dewey-dunnington-andrew-lamb.html +++ b/blog/author/tim-saucer-dewey-dunnington-andrew-lamb.html @@ -21,9 +21,9 @@ <ol id="post-list"> <li><article class="hentry"> - <header> <h2 class="entry-title"><a href="https://datafusion.apache.org/blog/2025/06/09/metadata-handling" rel="bookmark" title="Permalink to Field metadata and extension type support in user defined functions">Field metadata and extension type support in user defined functions</a></h2> </header> + <header> <h2 class="entry-title"><a href="https://datafusion.apache.org/blog/2025/07/29/metadata-handling" rel="bookmark" title="Permalink to Field metadata and extension type support in user defined functions">Field metadata and extension type support in user defined functions</a></h2> </header> <footer class="post-info"> - <time class="published" datetime="2025-06-09T00:00:00+00:00"> Mon 09 June 2025 </time> + <time class="published" datetime="2025-07-29T00:00:00+00:00"> Tue 29 July 2025 </time> <address class="vcard author">By <a class="url fn" href="https://datafusion.apache.org/blog/author/tim-saucer-dewey-dunnington-andrew-lamb.html">Tim Saucer, Dewey Dunnington, Andrew Lamb</a> </address> diff --git a/blog/category/blog.html b/blog/category/blog.html index d7d0128..a5cf22f 100644 --- a/blog/category/blog.html +++ b/blog/category/blog.html @@ -21,6 +21,35 @@ <h2>Articles in the blog category</h2> <ol id="post-list"> + <li><article class="hentry"> + <header> <h2 class="entry-title"><a href="https://datafusion.apache.org/blog/2025/07/29/metadata-handling" rel="bookmark" title="Permalink to Field metadata and extension type support in user defined functions">Field metadata and extension type support in user defined functions</a></h2> </header> + <footer class="post-info"> + <time class="published" datetime="2025-07-29T00:00:00+00:00"> Tue 29 July 2025 </time> + <address class="vcard author">By + <a class="url fn" href="https://datafusion.apache.org/blog/author/tim-saucer-dewey-dunnington-andrew-lamb.html">Tim Saucer, Dewey Dunnington, Andrew Lamb</a> + </address> + </footer><!-- /.post-info --> + <div class="entry-content"> <!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at +http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %}x +--> +<p><a href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">DataFusion 48.0.0</a> introduced a change in the interface for writing custom functions +which enables a variety of interesting improvements. Now users can access metadata on +the input columns to functions and produce metadata in the output.</p> +<p>Metadata is specified as a map of key-value pairs of strings. This …</p> </div><!-- /.entry-content --> + </article></li> <li><article class="hentry"> <header> <h2 class="entry-title"><a href="https://datafusion.apache.org/blog/2025/07/28/datafusion-49.0.0" rel="bookmark" title="Permalink to Apache DataFusion 49.0.0 Released">Apache DataFusion 49.0.0 Released</a></h2> </header> <footer class="post-info"> @@ -290,35 +319,6 @@ limitations under the License. role it plays, and described how industrial optimizers are organized. In this second post, we describe various optimizations that are found in <a href="https://datafusion.apache.org/">Apache DataFusion</a> and …</p> </div><!-- /.entry-content --> - </article></li> - <li><article class="hentry"> - <header> <h2 class="entry-title"><a href="https://datafusion.apache.org/blog/2025/06/09/metadata-handling" rel="bookmark" title="Permalink to Field metadata and extension type support in user defined functions">Field metadata and extension type support in user defined functions</a></h2> </header> - <footer class="post-info"> - <time class="published" datetime="2025-06-09T00:00:00+00:00"> Mon 09 June 2025 </time> - <address class="vcard author">By - <a class="url fn" href="https://datafusion.apache.org/blog/author/tim-saucer-dewey-dunnington-andrew-lamb.html">Tim Saucer, Dewey Dunnington, Andrew Lamb</a> - </address> - </footer><!-- /.post-info --> - <div class="entry-content"> <!-- -{% comment %} -Licensed to the Apache Software Foundation (ASF) under one or more -contributor license agreements. See the NOTICE file distributed with -this work for additional information regarding copyright ownership. -The ASF licenses this file to you under the Apache License, Version 2.0 -(the "License"); you may not use this file except in compliance with -the License. You may obtain a copy of the License at -http://www.apache.org/licenses/LICENSE-2.0 -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. -{% endcomment %}x ---> -<p><a href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">DataFusion 48.0.0</a> introduced a change in the interface for writing custom functions -which enables a variety of interesting improvements. Now users can access metadata on -the input columns to functions and produce metadata in the output.</p> -<p>Metadata is specified as a map of key-value pairs of strings. This …</p> </div><!-- /.entry-content --> </article></li> <li><article class="hentry"> <header> <h2 class="entry-title"><a href="https://datafusion.apache.org/blog/2025/05/06/datafusion-comet-0.8.0" rel="bookmark" title="Permalink to Apache DataFusion Comet 0.8.0 Release">Apache DataFusion Comet 0.8.0 Release</a></h2> </header> diff --git a/blog/feed.xml b/blog/feed.xml index 1458ca2..1fb7a90 100644 --- a/blog/feed.xml +++ b/blog/feed.xml @@ -1,5 +1,24 @@ <?xml version="1.0" encoding="utf-8"?> -<rss version="2.0"><channel><title>Apache DataFusion Blog</title><link>https://datafusion.apache.org/blog/</link><description></description><lastBuildDate>Mon, 28 Jul 2025 00:00:00 +0000</lastBuildDate><item><title>Apache DataFusion 49.0.0 Released</title><link>https://datafusion.apache.org/blog/2025/07/28/datafusion-49.0.0</link><description><!-- +<rss version="2.0"><channel><title>Apache DataFusion Blog</title><link>https://datafusion.apache.org/blog/</link><description></description><lastBuildDate>Tue, 29 Jul 2025 00:00:00 +0000</lastBuildDate><item><title>Field metadata and extension type support in user defined functions</title><link>https://datafusion.apache.org/blog/2025/07/29/metadata-handling</link><description><!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at +http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %}x +--> +<p><a href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">DataFusion 48.0.0</a> introduced a change in the interface for writing custom functions +which enables a variety of interesting improvements. Now users can access metadata on +the input columns to functions and produce metadata in the output.</p> +<p>Metadata is specified as a map of key-value pairs of strings. This …</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tim Saucer, Dewey Dunnington, Andrew Lamb</dc:creator><pubDate>Tue, 29 Jul 2025 00:00:00 +0000</pubDate><guid isPermaLink="false">tag:datafusion.apache.org,2025-07-29:/blog/2025/07/29/metadata-handling</guid><category>blog</category></item><item><title>Apache DataFusion 49.0.0 Released</title><link>https://datafusion.apache.org/blog/ [...] {% comment %} Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with @@ -189,26 +208,7 @@ limitations under the License. <p>In the <a href="https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-one">first part of this post</a>, we discussed what a Query Optimizer is, what role it plays, and described how industrial optimizers are organized. In this second post, we describe various optimizations that are found in <a href="https://datafusion.apache.org/">Apache -DataFusion</a> and …</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">alamb, akurmustafa</dc:creator><pubDate>Sun, 15 Jun 2025 00:00:00 +0000</pubDate><guid isPermaLink="false">tag:datafusion.apache.org,2025-06-15:/blog/2025/06/15/optimizing-sql-dataframes-part-two</guid><category>blog</category></item><item><title>Field metadata and extension type support in user defined functions</title><link>https://datafusion.apache.org/blog/2025/06/09/metadata-han [...] -{% comment %} -Licensed to the Apache Software Foundation (ASF) under one or more -contributor license agreements. See the NOTICE file distributed with -this work for additional information regarding copyright ownership. -The ASF licenses this file to you under the Apache License, Version 2.0 -(the "License"); you may not use this file except in compliance with -the License. You may obtain a copy of the License at -http://www.apache.org/licenses/LICENSE-2.0 -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. -{% endcomment %}x ---> -<p><a href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">DataFusion 48.0.0</a> introduced a change in the interface for writing custom functions -which enables a variety of interesting improvements. Now users can access metadata on -the input columns to functions and produce metadata in the output.</p> -<p>Metadata is specified as a map of key-value pairs of strings. This …</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tim Saucer, Dewey Dunnington, Andrew Lamb</dc:creator><pubDate>Mon, 09 Jun 2025 00:00:00 +0000</pubDate><guid isPermaLink="false">tag:datafusion.apache.org,2025-06-09:/blog/2025/06/09/metadata-handling</guid><category>blog</category></item><item><title>Apache DataFusion Comet 0.8.0 Release</title><link>https://datafusion.apache.org/b [...] +DataFusion</a> and …</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">alamb, akurmustafa</dc:creator><pubDate>Sun, 15 Jun 2025 00:00:00 +0000</pubDate><guid isPermaLink="false">tag:datafusion.apache.org,2025-06-15:/blog/2025/06/15/optimizing-sql-dataframes-part-two</guid><category>blog</category></item><item><title>Apache DataFusion Comet 0.8.0 Release</title><link>https://datafusion.apache.org/blog/2025/05/06/datafusion-comet-0.8.0</link><description><!-- {% comment %} Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with diff --git a/blog/feeds/all-en.atom.xml b/blog/feeds/all-en.atom.xml index ef3b25f..123d961 100644 --- a/blog/feeds/all-en.atom.xml +++ b/blog/feeds/all-en.atom.xml @@ -1,5 +1,222 @@ <?xml version="1.0" encoding="utf-8"?> -<feed xmlns="http://www.w3.org/2005/Atom"><title>Apache DataFusion Blog</title><link href="https://datafusion.apache.org/blog/" rel="alternate"></link><link href="https://datafusion.apache.org/blog/feeds/all-en.atom.xml" rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-07-28T00:00:00+00:00</updated><subtitle></subtitle><entry><title>Apache DataFusion 49.0.0 Released</title><link href="https://datafusion.apache.org/blog/2025/07/28/datafusion-49.0.0" rel="alterna [...] +<feed xmlns="http://www.w3.org/2005/Atom"><title>Apache DataFusion Blog</title><link href="https://datafusion.apache.org/blog/" rel="alternate"></link><link href="https://datafusion.apache.org/blog/feeds/all-en.atom.xml" rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-07-29T00:00:00+00:00</updated><subtitle></subtitle><entry><title>Field metadata and extension type support in user defined functions</title><link href="https://datafusion.apache.org/blog/2025/07/ [...] +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at +http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %}x +--> +<p><a href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">DataFusion 48.0.0</a> introduced a change in the interface for writing custom functions +which enables a variety of interesting improvements. Now users can access metadata on +the input columns to functions and produce metadata in the output.</p> +<p>Metadata is specified as a map of key-value pairs of strings. This …</p></summary><content type="html"><!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at +http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %}x +--> +<p><a href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">DataFusion 48.0.0</a> introduced a change in the interface for writing custom functions +which enables a variety of interesting improvements. Now users can access metadata on +the input columns to functions and produce metadata in the output.</p> +<p>Metadata is specified as a map of key-value pairs of strings. This extra metadata is used +by Arrow implementations implement <a href="https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types">extension types</a> and can also be used to add +use case-specific context to a column of values where the formality of an extension type +is not required. In previous versions of DataFusion field metadata was propagated through +certain operations (e.g., renaming or selecting a column) but was not accessible to others +(e.g., scalar, window, or aggregate function calls). In the new implementation, during +processing of all user defined functions we pass the input field information and allow +user defined function implementations to return field information to the caller.</p> +<p><a href="https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types">Extension types</a> are user defined data types where the data is stored using one of the +existing <a href="https://arrow.apache.org/docs/format/Columnar.html#data-types">Arrow data types</a> but the metadata specifies how we are to interpret the +stored data. The use of extension types was one of the primary motivations for adding +metadata to the function processing, but arbitrary metadata can be put on the input and +output fields. This allows for a range of other interesting use cases.</p> +<h2>Why metadata handling is important</h2> +<p>Data in Arrow record batches carry a <code>Schema</code> in addition to the Arrow arrays. Each +<a href="https://arrow.apache.org/docs/format/Glossary.html#term-field">Field</a> in this <code>Schema</code> contains a name, data type, nullability, and metadata. The +metadata is specified as a map of key-value pairs of strings. In the new +implementation, during processing of all user defined functions we pass the input +field information.</p> +<figure> +<img alt="Relationship between a Record Batch, it's schema, and the underlying arrays. There is a one to one relationship between each Field in the Schema and Array entry in the Columns." class="img-responsive" src="/blog/images/metadata-handling/arrow_record_batch.png" width="100%"/> +<figcaption> + Relationship between a Record Batch, it's schema, and the underlying arrays. There is a one to one relationship between each Field in the Schema and Array entry in the Columns. + </figcaption> +</figure> +<p>It is often desirable to write a generic function for reuse. With the prior version of +user defined functions, we only had access to the <code>DataType</code> of the input columns. This +works well for some features that only rely on the types of data. Other use cases may +need additional information that describes the data.</p> +<p>For example, suppose I wish to write a function that takes in a UUID and returns a string +of the <a href="https://www.ietf.org/rfc/rfc9562.html#section-4.1">variant</a> of the input field. We would want this function to be able to handle +all of the string types and also a binary encoded UUID. The arrow specification does not +contain a unsigned 128 bit value, it is common to encode a UUID as a fixed sized binary +array where each element is 16 bytes long. With the metadata handling in <a href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">DataFusion 48.0.0</a> +we can validate during planning that the input data not only has the correct underlying +data type, but that it also represents the right <em>kind</em> of data. The UUID example is a +common one, and it is included in the <a href="https://arrow.apache.org/docs/format/CanonicalExtensions.html">canonical extension types</a> that are now +supported in DataFusion.</p> +<p>Another common application of metadata handling is understanding encoding of a blob of data. +Suppose you have a column that contains image data. Most likely this data is stored as +an array of <code>u8</code> data. Without knowing a priori what the encoding of that blob of data is, +you cannot ensure you are using the correct methods for decoding it. You may work around +this by adding another column to your data source indicating the encoding, but this can be +wasteful for systems where the encoding never changes. Instead, you could use metadata to +specify the encoding for the entire column.</p> +<h2>How to use metadata in user defined functions</h2> +<p>When working with metadata for user defined scalar functions, there are typically two +places in the function definition that require implementation.</p> +<ul> +<li>Computing the return field from the arguments</li> +<li>Invocation</li> +</ul> +<p>During planning, we will attempt to call the function <code>return_field_from_args()</code>. This will +provide a list of input fields to the function and return the output field. To evaluate +metadata on the input side, you can write a functions similar to this example:</p> +<pre><code class="language-rust">fn return_field_from_args( + &amp;self, + args: ReturnFieldArgs, +) -&gt; datafusion::common::Result&lt;FieldRef&gt; { + if args.arg_fields.len() != 1 { + return exec_err!("Incorrect number of arguments for uuid_version"); + } + + let input_field = &amp;args.arg_fields[0]; + if &amp;DataType::FixedSizeBinary(16) == input_field.data_type() { + let Ok(CanonicalExtensionType::Uuid(_)) = input_field.try_canonical_extension_type() + else { + return exec_err!("Input field must contain the UUID canonical extension type"); + }; + } + + Ok(Arc::new(Field::new(self.name(), DataType::UInt32, true))) +} +</code></pre> +<p>In this example, we take advantage of the fact that we already have support for extension +types that evaluate metadata. If you were attempting to check for metadata other than +extension type support, we could have instead written a snippet such as:</p> +<pre><code class="language-rust"> if &amp;DataType::FixedSizeBinary(16) == input_field.data_type() { + let _ = input_field + .metadata() + .get("ARROW:extension:metadata") + .ok_or(exec_datafusion_err!("Input field must contain the UUID canonical extension type"))?; + }; + } +</code></pre> +<p>If you are writing a user defined function that will instead return metadata on output +you can add this directly into the <code>Field</code> that is the output of the <code>return_field_from_args</code> +call. In our above example, we could change the return line to:</p> +<pre><code class="language-rust"> Ok(Arc::new( + Field::new(self.name(), DataType::UInt32, true).with_metadata( + [("my_key".to_string(), "my_value".to_string())] + .into_iter() + .collect(), + ), + )) +</code></pre> +<p>By checking the metadata during the planning process, we can identify errors early in +the query process. There are cases were we wish to have access to this metadata during +execution as well. The function <code>invoke_with_args</code> in the user defined function takes +the updated struct <code>ScalarFunctionArgs</code>. This now contains the input fields, which can +be used to check for metadata. For example, you can do the following:</p> +<pre><code class="language-rust">fn invoke_with_args(&amp;self, args: ScalarFunctionArgs) -&gt; Result&lt;ColumnarValue&gt; { + assert_eq!(args.arg_fields.len(), 1); + let my_value = args.arg_fields[0] + .metadata() + .get("encoding_type"); + ... +</code></pre> +<p>In this snippet we have extracted an <code>Option&lt;String&gt;</code> from the input field metadata +which we can then use to determine which functions we might want to call. We could +then parse the returned value to determine what type of encoding to use when +evaluating the array in the arguments. Since <code>return_field_from_args</code> is not <code>&amp;mut self</code> +this check could not be performed during the planning stage.</p> +<p>The description in this section applies to scalar user defined functions, but equivalent +support exists for aggregate and window functions.</p> +<h2>Extension types</h2> +<p>Extension types are one of the primary motivations for this enhancement in +<a href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">Datafusion 48.0.0</a>. The official Rust implementation of Apache Arrow, <a href="https://github.com/apache/arrow-rs">arrow-rs</a>, +already contains support for the <a href="https://arrow.apache.org/docs/format/CanonicalExtensions.html">canonical extension types</a>. This support includes +helper functions such as <code>try_canonical_extension_type()</code> in the earlier example.</p> +<p>For a concrete example of how extension types can be used in DataFusion functions, +there is an <a href="https://github.com/timsaucer/datafusion_extension_type_examples">example repository</a> that demonstrates using UUIDs. The UUID extension +type specifies that the data are stored as a Fixed Size Binary of length 16. In the +DataFusion core functions, we have the ability to generate string representations of +UUIDs that match the version 4 specification. These are helpful, but a user may +wish to do additional work with UUIDs where having them in the dense representation +is preferable. Alternatively, the user may already have data with the binary encoding +and we want to extract values such as the version, timestamp, or string +representation.</p> +<p>In the example repository we have created three user defined functions: <code>UuidVersion</code>, +<code>StringToUuid</code>, and <code>UuidToString</code>. Each of these implements <code>ScalarUDFImpl</code> and can +be used thusly:</p> +<pre><code class="language-rust">async fn main() -&gt; Result&lt;()&gt; { + let ctx = create_context()?; + + // get a DataFrame from the context + let mut df = ctx.table("t").await?; + + // Create the string UUIDs + df = df.select(vec![uuid().alias("string_uuid")])?; + + // Convert string UUIDs to canonical extension UUIDs + let string_to_uuid = ScalarUDF::new_from_impl(StringToUuid::default()); + df = df.with_column("uuid", string_to_uuid.call(vec![col("string_uuid")]))?; + + // Extract version number from canonical extension UUIDs + let version = ScalarUDF::new_from_impl(UuidVersion::default()); + df = df.with_column("version", version.call(vec![col("uuid")]))?; + + // Convert back to a string + let uuid_to_string = ScalarUDF::new_from_impl(UuidToString::default()); + df = df.with_column("string_round_trip", uuid_to_string.call(vec![col("uuid")]))?; + + df.show().await?; + + Ok(()) +} +</code></pre> +<p>The <a href="https://github.com/timsaucer/datafusion_extension_type_examples">example repository</a> also contains a crate that demonstrates how to expose these +UDFs to <a href="https://datafusion.apache.org/python/">datafusion-python</a>. This requires version 48.0.0 or later.</p> +<h2>Thanks to our sponsor</h2> +<p>We would like to thank <a href="https://rerun.io">Rerun.io</a> for sponsoring the development of this work. <a href="https://rerun.io">Rerun.io</a> +is building a data visualization system for Physical AI and uses metadata to specify +context about columns in Arrow record batches.</p> +<h2>Conclusion</h2> +<p>The enhancements to the metadata handling in <a href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">DataFusion 48.0.0</a> are a significant step +forward in the ability to handle more interesting types of data. We can validate the input +data matches not only the data types but also the intent of the data to be processed. We +can enable complex operations on binary data because we understand the encoding used. We +can also use metadata to create new and interesting user defined data types. </p> +<h2>Get Involved</h2> +<p>The DataFusion team is an active and engaging community and we would love to have you join +us and help the project.</p> +<p>Here are some ways to get involved:</p> +<ul> +<li>Learn more by visiting the <a href="https://datafusion.apache.org/index.html">DataFusion</a> project page.</li> +<li>Try out the project and provide feedback, file issues, and contribute code.</li> +<li>Work on a <a href="https://github.com/apache/datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22">good first issue</a>.</li> +<li>Reach out to us via the <a href="https://datafusion.apache.org/contributor-guide/communication.html">communication doc</a>.</li> +</ul></content><category term="blog"></category></entry><entry><title>Apache DataFusion 49.0.0 Released</title><link href="https://datafusion.apache.org/blog/2025/07/28/datafusion-49.0.0" rel="alternate"></link><published>2025-07-28T00:00:00+00:00</published><updated>2025-07-28T00:00:00+00:00</updated><author><name>pmc</name></author><id>tag:datafusion.apache.org,2025-07-28:/blog/2025/07/28/datafusion-49.0.0</id><summary type="html"><!-- {% comment %} Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with @@ -2374,224 +2591,7 @@ community</a>. We welcome first time contributors as well as long time par to the fun of building a database together.</p> <h2>Notes</h2> <p><a id="footnote7"></a><sup>[7]</sup> See <a href="https://btw-2015.informatik.uni-hamburg.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf">Unnesting Arbitrary Queries</a> from Neumann and Kemper for a more academic treatment.</p> -<p><a id="footnote8"></a><sup>[8]</sup> One of my favorite terms I learned from Andy Pavlo&rsquo;s CMU online lectures</p></content><category term="blog"></category></entry><entry><title>Field metadata and extension type support in user defined functions</title><link href="https://datafusion.apache.org/blog/2025/06/09/metadata-handling" rel="alternate"></link><published>2025-06-09T00:00:00+00:00</published><updated>2025-06-09T00:00:00+00:00</update [...] -{% comment %} -Licensed to the Apache Software Foundation (ASF) under one or more -contributor license agreements. See the NOTICE file distributed with -this work for additional information regarding copyright ownership. -The ASF licenses this file to you under the Apache License, Version 2.0 -(the "License"); you may not use this file except in compliance with -the License. You may obtain a copy of the License at -http://www.apache.org/licenses/LICENSE-2.0 -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. -{% endcomment %}x ---> -<p><a href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">DataFusion 48.0.0</a> introduced a change in the interface for writing custom functions -which enables a variety of interesting improvements. Now users can access metadata on -the input columns to functions and produce metadata in the output.</p> -<p>Metadata is specified as a map of key-value pairs of strings. This …</p></summary><content type="html"><!-- -{% comment %} -Licensed to the Apache Software Foundation (ASF) under one or more -contributor license agreements. See the NOTICE file distributed with -this work for additional information regarding copyright ownership. -The ASF licenses this file to you under the Apache License, Version 2.0 -(the "License"); you may not use this file except in compliance with -the License. You may obtain a copy of the License at -http://www.apache.org/licenses/LICENSE-2.0 -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. -{% endcomment %}x ---> -<p><a href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">DataFusion 48.0.0</a> introduced a change in the interface for writing custom functions -which enables a variety of interesting improvements. Now users can access metadata on -the input columns to functions and produce metadata in the output.</p> -<p>Metadata is specified as a map of key-value pairs of strings. This extra metadata is used -by Arrow implementations implement <a href="https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types">extension types</a> and can also be used to add -use case-specific context to a column of values where the formality of an extension type -is not required. In previous versions of DataFusion field metadata was propagated through -certain operations (e.g., renaming or selecting a column) but was not accessible to others -(e.g., scalar, window, or aggregate function calls). In the new implementation, during -processing of all user defined functions we pass the input field information and allow -user defined function implementations to return field information to the caller.</p> -<p><a href="https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types">Extension types</a> are user defined data types where the data is stored using one of the -existing <a href="https://arrow.apache.org/docs/format/Columnar.html#data-types">Arrow data types</a> but the metadata specifies how we are to interpret the -stored data. The use of extension types was one of the primary motivations for adding -metadata to the function processing, but arbitrary metadata can be put on the input and -output fields. This allows for a range of other interesting use cases.</p> -<h2>Why metadata handling is important</h2> -<p>Data in Arrow record batches carry a <code>Schema</code> in addition to the Arrow arrays. Each -<a href="https://arrow.apache.org/docs/format/Glossary.html#term-field">Field</a> in this <code>Schema</code> contains a name, data type, nullability, and metadata. The -metadata is specified as a map of key-value pairs of strings. In the new -implementation, during processing of all user defined functions we pass the input -field information.</p> -<figure> -<img alt="Relationship between a Record Batch, it's schema, and the underlying arrays. There is a one to one relationship between each Field in the Schema and Array entry in the Columns." class="img-responsive" src="/blog/images/metadata-handling/arrow_record_batch.png" width="100%"/> -<figcaption> - Relationship between a Record Batch, it's schema, and the underlying arrays. There is a one to one relationship between each Field in the Schema and Array entry in the Columns. - </figcaption> -</figure> -<p>It is often desirable to write a generic function for reuse. With the prior version of -user defined functions, we only had access to the <code>DataType</code> of the input columns. This -works well for some features that only rely on the types of data. Other use cases may -need additional information that describes the data.</p> -<p>For example, suppose I wish to write a function that takes in a UUID and returns a string -of the <a href="https://www.ietf.org/rfc/rfc9562.html#section-4.1">variant</a> of the input field. We would want this function to be able to handle -all of the string types and also a binary encoded UUID. The arrow specification does not -contain a unsigned 128 bit value, it is common to encode a UUID as a fixed sized binary -array where each element is 16 bytes long. With the metadata handling in <a href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">DataFusion 48.0.0</a> -we can validate during planning that the input data not only has the correct underlying -data type, but that it also represents the right <em>kind</em> of data. The UUID example is a -common one, and it is included in the <a href="https://arrow.apache.org/docs/format/CanonicalExtensions.html">canonical extension types</a> that are now -supported in DataFusion.</p> -<p>Another common application of metadata handling is understanding encoding of a blob of data. -Suppose you have a column that contains image data. Most likely this data is stored as -an array of <code>u8</code> data. Without knowing a priori what the encoding of that blob of data is, -you cannot ensure you are using the correct methods for decoding it. You may work around -this by adding another column to your data source indicating the encoding, but this can be -wasteful for systems where the encoding never changes. Instead, you could use metadata to -specify the encoding for the entire column.</p> -<h2>How to use metadata in user defined functions</h2> -<p>When working with metadata for user defined scalar functions, there are typically two -places in the function definition that require implementation.</p> -<ul> -<li>Computing the return field from the arguments</li> -<li>Invocation</li> -</ul> -<p>During planning, we will attempt to call the function <code>return_field_from_args()</code>. This will -provide a list of input fields to the function and return the output field. To evaluate -metadata on the input side, you can write a functions similar to this example:</p> -<pre><code class="language-rust">fn return_field_from_args( - &amp;self, - args: ReturnFieldArgs, -) -&gt; datafusion::common::Result&lt;FieldRef&gt; { - if args.arg_fields.len() != 1 { - return exec_err!("Incorrect number of arguments for uuid_version"); - } - - let input_field = &amp;args.arg_fields[0]; - if &amp;DataType::FixedSizeBinary(16) == input_field.data_type() { - let Ok(CanonicalExtensionType::Uuid(_)) = input_field.try_canonical_extension_type() - else { - return exec_err!("Input field must contain the UUID canonical extension type"); - }; - } - - Ok(Arc::new(Field::new(self.name(), DataType::UInt32, true))) -} -</code></pre> -<p>In this example, we take advantage of the fact that we already have support for extension -types that evaluate metadata. If you were attempting to check for metadata other than -extension type support, we could have instead written a snippet such as:</p> -<pre><code class="language-rust"> if &amp;DataType::FixedSizeBinary(16) == input_field.data_type() { - let _ = input_field - .metadata() - .get("ARROW:extension:metadata") - .ok_or(exec_datafusion_err!("Input field must contain the UUID canonical extension type"))?; - }; - } -</code></pre> -<p>If you are writing a user defined function that will instead return metadata on output -you can add this directly into the <code>Field</code> that is the output of the <code>return_field_from_args</code> -call. In our above example, we could change the return line to:</p> -<pre><code class="language-rust"> Ok(Arc::new( - Field::new(self.name(), DataType::UInt32, true).with_metadata( - [("my_key".to_string(), "my_value".to_string())] - .into_iter() - .collect(), - ), - )) -</code></pre> -<p>By checking the metadata during the planning process, we can identify errors early in -the query process. There are cases were we wish to have access to this metadata during -execution as well. The function <code>invoke_with_args</code> in the user defined function takes -the updated struct <code>ScalarFunctionArgs</code>. This now contains the input fields, which can -be used to check for metadata. For example, you can do the following:</p> -<pre><code class="language-rust">fn invoke_with_args(&amp;self, args: ScalarFunctionArgs) -&gt; Result&lt;ColumnarValue&gt; { - assert_eq!(args.arg_fields.len(), 1); - let my_value = args.arg_fields[0] - .metadata() - .get("encoding_type"); - ... -</code></pre> -<p>In this snippet we have extracted an <code>Option&lt;String&gt;</code> from the input field metadata -which we can then use to determine which functions we might want to call. We could -then parse the returned value to determine what type of encoding to use when -evaluating the array in the arguments. Since <code>return_field_from_args</code> is not <code>&amp;mut self</code> -this check could not be performed during the planning stage.</p> -<p>The description in this section applies to scalar user defined functions, but equivalent -support exists for aggregate and window functions.</p> -<h2>Extension types</h2> -<p>Extension types are one of the primary motivations for this enhancement in -<a href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">Datafusion 48.0.0</a>. The official Rust implementation of Apache Arrow, <a href="https://github.com/apache/arrow-rs">arrow-rs</a>, -already contains support for the <a href="https://arrow.apache.org/docs/format/CanonicalExtensions.html">canonical extension types</a>. This support includes -helper functions such as <code>try_canonical_extension_type()</code> in the earlier example.</p> -<p>For a concrete example of how extension types can be used in DataFusion functions, -there is an <a href="https://github.com/timsaucer/datafusion_extension_type_examples">example repository</a> that demonstrates using UUIDs. The UUID extension -type specifies that the data are stored as a Fixed Size Binary of length 16. In the -DataFusion core functions, we have the ability to generate string representations of -UUIDs that match the version 4 specification. These are helpful, but a user may -wish to do additional work with UUIDs where having them in the dense representation -is preferable. Alternatively, the user may already have data with the binary encoding -and we want to extract values such as the version, timestamp, or string -representation.</p> -<p>In the example repository we have created three user defined functions: <code>UuidVersion</code>, -<code>StringToUuid</code>, and <code>UuidToString</code>. Each of these implements <code>ScalarUDFImpl</code> and can -be used thusly:</p> -<pre><code class="language-rust">async fn main() -&gt; Result&lt;()&gt; { - let ctx = create_context()?; - - // get a DataFrame from the context - let mut df = ctx.table("t").await?; - - // Create the string UUIDs - df = df.select(vec![uuid().alias("string_uuid")])?; - - // Convert string UUIDs to canonical extension UUIDs - let string_to_uuid = ScalarUDF::new_from_impl(StringToUuid::default()); - df = df.with_column("uuid", string_to_uuid.call(vec![col("string_uuid")]))?; - - // Extract version number from canonical extension UUIDs - let version = ScalarUDF::new_from_impl(UuidVersion::default()); - df = df.with_column("version", version.call(vec![col("uuid")]))?; - - // Convert back to a string - let uuid_to_string = ScalarUDF::new_from_impl(UuidToString::default()); - df = df.with_column("string_round_trip", uuid_to_string.call(vec![col("uuid")]))?; - - df.show().await?; - - Ok(()) -} -</code></pre> -<p>The <a href="https://github.com/timsaucer/datafusion_extension_type_examples">example repository</a> also contains a crate that demonstrates how to expose these -UDFs to <a href="https://datafusion.apache.org/python/">datafusion-python</a>. This requires version 48.0.0 or later.</p> -<h2>Thanks to our sponsor</h2> -<p>We would like to thank <a href="https://rerun.io">Rerun.io</a> for sponsoring the development of this work. <a href="https://rerun.io">Rerun.io</a> -is building a data visualization system for Physical AI and uses metadata to specify -context about columns in Arrow record batches.</p> -<h2>Conclusion</h2> -<p>The enhancements to the metadata handling in <a href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">DataFusion 48.0.0</a> are a significant step -forward in the ability to handle more interesting types of data. We can validate the input -data matches not only the data types but also the intent of the data to be processed. We -can enable complex operations on binary data because we understand the encoding used. We -can also use metadata to create new and interesting user defined data types. </p> -<h2>Get Involved</h2> -<p>The DataFusion team is an active and engaging community and we would love to have you join -us and help the project.</p> -<p>Here are some ways to get involved:</p> -<ul> -<li>Learn more by visiting the <a href="https://datafusion.apache.org/index.html">DataFusion</a> project page.</li> -<li>Try out the project and provide feedback, file issues, and contribute code.</li> -<li>Work on a <a href="https://github.com/apache/datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22">good first issue</a>.</li> -<li>Reach out to us via the <a href="https://datafusion.apache.org/contributor-guide/communication.html">communication doc</a>.</li> -</ul></content><category term="blog"></category></entry><entry><title>Apache DataFusion Comet 0.8.0 Release</title><link href="https://datafusion.apache.org/blog/2025/05/06/datafusion-comet-0.8.0" rel="alternate"></link><published>2025-05-06T00:00:00+00:00</published><updated>2025-05-06T00:00:00+00:00</updated><author><name>pmc</name></author><id>tag:datafusion.apache.org,2025-05-06:/blog/2025/05/06/datafusion-comet-0.8.0</id><summary type="html"><!-- +<p><a id="footnote8"></a><sup>[8]</sup> One of my favorite terms I learned from Andy Pavlo&rsquo;s CMU online lectures</p></content><category term="blog"></category></entry><entry><title>Apache DataFusion Comet 0.8.0 Release</title><link href="https://datafusion.apache.org/blog/2025/05/06/datafusion-comet-0.8.0" rel="alternate"></link><published>2025-05-06T00:00:00+00:00</published><updated>2025-05-06T00:00:00+00:00</updated><author><name>pmc</name [...] {% comment %} Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with diff --git a/blog/feeds/blog.atom.xml b/blog/feeds/blog.atom.xml index 10a1e31..42f82c1 100644 --- a/blog/feeds/blog.atom.xml +++ b/blog/feeds/blog.atom.xml @@ -1,5 +1,222 @@ <?xml version="1.0" encoding="utf-8"?> -<feed xmlns="http://www.w3.org/2005/Atom"><title>Apache DataFusion Blog - blog</title><link href="https://datafusion.apache.org/blog/" rel="alternate"></link><link href="https://datafusion.apache.org/blog/feeds/blog.atom.xml" rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-07-28T00:00:00+00:00</updated><subtitle></subtitle><entry><title>Apache DataFusion 49.0.0 Released</title><link href="https://datafusion.apache.org/blog/2025/07/28/datafusion-49.0.0" rel="al [...] +<feed xmlns="http://www.w3.org/2005/Atom"><title>Apache DataFusion Blog - blog</title><link href="https://datafusion.apache.org/blog/" rel="alternate"></link><link href="https://datafusion.apache.org/blog/feeds/blog.atom.xml" rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-07-29T00:00:00+00:00</updated><subtitle></subtitle><entry><title>Field metadata and extension type support in user defined functions</title><link href="https://datafusion.apache.org/blog/202 [...] +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at +http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %}x +--> +<p><a href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">DataFusion 48.0.0</a> introduced a change in the interface for writing custom functions +which enables a variety of interesting improvements. Now users can access metadata on +the input columns to functions and produce metadata in the output.</p> +<p>Metadata is specified as a map of key-value pairs of strings. This …</p></summary><content type="html"><!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at +http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %}x +--> +<p><a href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">DataFusion 48.0.0</a> introduced a change in the interface for writing custom functions +which enables a variety of interesting improvements. Now users can access metadata on +the input columns to functions and produce metadata in the output.</p> +<p>Metadata is specified as a map of key-value pairs of strings. This extra metadata is used +by Arrow implementations implement <a href="https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types">extension types</a> and can also be used to add +use case-specific context to a column of values where the formality of an extension type +is not required. In previous versions of DataFusion field metadata was propagated through +certain operations (e.g., renaming or selecting a column) but was not accessible to others +(e.g., scalar, window, or aggregate function calls). In the new implementation, during +processing of all user defined functions we pass the input field information and allow +user defined function implementations to return field information to the caller.</p> +<p><a href="https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types">Extension types</a> are user defined data types where the data is stored using one of the +existing <a href="https://arrow.apache.org/docs/format/Columnar.html#data-types">Arrow data types</a> but the metadata specifies how we are to interpret the +stored data. The use of extension types was one of the primary motivations for adding +metadata to the function processing, but arbitrary metadata can be put on the input and +output fields. This allows for a range of other interesting use cases.</p> +<h2>Why metadata handling is important</h2> +<p>Data in Arrow record batches carry a <code>Schema</code> in addition to the Arrow arrays. Each +<a href="https://arrow.apache.org/docs/format/Glossary.html#term-field">Field</a> in this <code>Schema</code> contains a name, data type, nullability, and metadata. The +metadata is specified as a map of key-value pairs of strings. In the new +implementation, during processing of all user defined functions we pass the input +field information.</p> +<figure> +<img alt="Relationship between a Record Batch, it's schema, and the underlying arrays. There is a one to one relationship between each Field in the Schema and Array entry in the Columns." class="img-responsive" src="/blog/images/metadata-handling/arrow_record_batch.png" width="100%"/> +<figcaption> + Relationship between a Record Batch, it's schema, and the underlying arrays. There is a one to one relationship between each Field in the Schema and Array entry in the Columns. + </figcaption> +</figure> +<p>It is often desirable to write a generic function for reuse. With the prior version of +user defined functions, we only had access to the <code>DataType</code> of the input columns. This +works well for some features that only rely on the types of data. Other use cases may +need additional information that describes the data.</p> +<p>For example, suppose I wish to write a function that takes in a UUID and returns a string +of the <a href="https://www.ietf.org/rfc/rfc9562.html#section-4.1">variant</a> of the input field. We would want this function to be able to handle +all of the string types and also a binary encoded UUID. The arrow specification does not +contain a unsigned 128 bit value, it is common to encode a UUID as a fixed sized binary +array where each element is 16 bytes long. With the metadata handling in <a href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">DataFusion 48.0.0</a> +we can validate during planning that the input data not only has the correct underlying +data type, but that it also represents the right <em>kind</em> of data. The UUID example is a +common one, and it is included in the <a href="https://arrow.apache.org/docs/format/CanonicalExtensions.html">canonical extension types</a> that are now +supported in DataFusion.</p> +<p>Another common application of metadata handling is understanding encoding of a blob of data. +Suppose you have a column that contains image data. Most likely this data is stored as +an array of <code>u8</code> data. Without knowing a priori what the encoding of that blob of data is, +you cannot ensure you are using the correct methods for decoding it. You may work around +this by adding another column to your data source indicating the encoding, but this can be +wasteful for systems where the encoding never changes. Instead, you could use metadata to +specify the encoding for the entire column.</p> +<h2>How to use metadata in user defined functions</h2> +<p>When working with metadata for user defined scalar functions, there are typically two +places in the function definition that require implementation.</p> +<ul> +<li>Computing the return field from the arguments</li> +<li>Invocation</li> +</ul> +<p>During planning, we will attempt to call the function <code>return_field_from_args()</code>. This will +provide a list of input fields to the function and return the output field. To evaluate +metadata on the input side, you can write a functions similar to this example:</p> +<pre><code class="language-rust">fn return_field_from_args( + &amp;self, + args: ReturnFieldArgs, +) -&gt; datafusion::common::Result&lt;FieldRef&gt; { + if args.arg_fields.len() != 1 { + return exec_err!("Incorrect number of arguments for uuid_version"); + } + + let input_field = &amp;args.arg_fields[0]; + if &amp;DataType::FixedSizeBinary(16) == input_field.data_type() { + let Ok(CanonicalExtensionType::Uuid(_)) = input_field.try_canonical_extension_type() + else { + return exec_err!("Input field must contain the UUID canonical extension type"); + }; + } + + Ok(Arc::new(Field::new(self.name(), DataType::UInt32, true))) +} +</code></pre> +<p>In this example, we take advantage of the fact that we already have support for extension +types that evaluate metadata. If you were attempting to check for metadata other than +extension type support, we could have instead written a snippet such as:</p> +<pre><code class="language-rust"> if &amp;DataType::FixedSizeBinary(16) == input_field.data_type() { + let _ = input_field + .metadata() + .get("ARROW:extension:metadata") + .ok_or(exec_datafusion_err!("Input field must contain the UUID canonical extension type"))?; + }; + } +</code></pre> +<p>If you are writing a user defined function that will instead return metadata on output +you can add this directly into the <code>Field</code> that is the output of the <code>return_field_from_args</code> +call. In our above example, we could change the return line to:</p> +<pre><code class="language-rust"> Ok(Arc::new( + Field::new(self.name(), DataType::UInt32, true).with_metadata( + [("my_key".to_string(), "my_value".to_string())] + .into_iter() + .collect(), + ), + )) +</code></pre> +<p>By checking the metadata during the planning process, we can identify errors early in +the query process. There are cases were we wish to have access to this metadata during +execution as well. The function <code>invoke_with_args</code> in the user defined function takes +the updated struct <code>ScalarFunctionArgs</code>. This now contains the input fields, which can +be used to check for metadata. For example, you can do the following:</p> +<pre><code class="language-rust">fn invoke_with_args(&amp;self, args: ScalarFunctionArgs) -&gt; Result&lt;ColumnarValue&gt; { + assert_eq!(args.arg_fields.len(), 1); + let my_value = args.arg_fields[0] + .metadata() + .get("encoding_type"); + ... +</code></pre> +<p>In this snippet we have extracted an <code>Option&lt;String&gt;</code> from the input field metadata +which we can then use to determine which functions we might want to call. We could +then parse the returned value to determine what type of encoding to use when +evaluating the array in the arguments. Since <code>return_field_from_args</code> is not <code>&amp;mut self</code> +this check could not be performed during the planning stage.</p> +<p>The description in this section applies to scalar user defined functions, but equivalent +support exists for aggregate and window functions.</p> +<h2>Extension types</h2> +<p>Extension types are one of the primary motivations for this enhancement in +<a href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">Datafusion 48.0.0</a>. The official Rust implementation of Apache Arrow, <a href="https://github.com/apache/arrow-rs">arrow-rs</a>, +already contains support for the <a href="https://arrow.apache.org/docs/format/CanonicalExtensions.html">canonical extension types</a>. This support includes +helper functions such as <code>try_canonical_extension_type()</code> in the earlier example.</p> +<p>For a concrete example of how extension types can be used in DataFusion functions, +there is an <a href="https://github.com/timsaucer/datafusion_extension_type_examples">example repository</a> that demonstrates using UUIDs. The UUID extension +type specifies that the data are stored as a Fixed Size Binary of length 16. In the +DataFusion core functions, we have the ability to generate string representations of +UUIDs that match the version 4 specification. These are helpful, but a user may +wish to do additional work with UUIDs where having them in the dense representation +is preferable. Alternatively, the user may already have data with the binary encoding +and we want to extract values such as the version, timestamp, or string +representation.</p> +<p>In the example repository we have created three user defined functions: <code>UuidVersion</code>, +<code>StringToUuid</code>, and <code>UuidToString</code>. Each of these implements <code>ScalarUDFImpl</code> and can +be used thusly:</p> +<pre><code class="language-rust">async fn main() -&gt; Result&lt;()&gt; { + let ctx = create_context()?; + + // get a DataFrame from the context + let mut df = ctx.table("t").await?; + + // Create the string UUIDs + df = df.select(vec![uuid().alias("string_uuid")])?; + + // Convert string UUIDs to canonical extension UUIDs + let string_to_uuid = ScalarUDF::new_from_impl(StringToUuid::default()); + df = df.with_column("uuid", string_to_uuid.call(vec![col("string_uuid")]))?; + + // Extract version number from canonical extension UUIDs + let version = ScalarUDF::new_from_impl(UuidVersion::default()); + df = df.with_column("version", version.call(vec![col("uuid")]))?; + + // Convert back to a string + let uuid_to_string = ScalarUDF::new_from_impl(UuidToString::default()); + df = df.with_column("string_round_trip", uuid_to_string.call(vec![col("uuid")]))?; + + df.show().await?; + + Ok(()) +} +</code></pre> +<p>The <a href="https://github.com/timsaucer/datafusion_extension_type_examples">example repository</a> also contains a crate that demonstrates how to expose these +UDFs to <a href="https://datafusion.apache.org/python/">datafusion-python</a>. This requires version 48.0.0 or later.</p> +<h2>Thanks to our sponsor</h2> +<p>We would like to thank <a href="https://rerun.io">Rerun.io</a> for sponsoring the development of this work. <a href="https://rerun.io">Rerun.io</a> +is building a data visualization system for Physical AI and uses metadata to specify +context about columns in Arrow record batches.</p> +<h2>Conclusion</h2> +<p>The enhancements to the metadata handling in <a href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">DataFusion 48.0.0</a> are a significant step +forward in the ability to handle more interesting types of data. We can validate the input +data matches not only the data types but also the intent of the data to be processed. We +can enable complex operations on binary data because we understand the encoding used. We +can also use metadata to create new and interesting user defined data types. </p> +<h2>Get Involved</h2> +<p>The DataFusion team is an active and engaging community and we would love to have you join +us and help the project.</p> +<p>Here are some ways to get involved:</p> +<ul> +<li>Learn more by visiting the <a href="https://datafusion.apache.org/index.html">DataFusion</a> project page.</li> +<li>Try out the project and provide feedback, file issues, and contribute code.</li> +<li>Work on a <a href="https://github.com/apache/datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22">good first issue</a>.</li> +<li>Reach out to us via the <a href="https://datafusion.apache.org/contributor-guide/communication.html">communication doc</a>.</li> +</ul></content><category term="blog"></category></entry><entry><title>Apache DataFusion 49.0.0 Released</title><link href="https://datafusion.apache.org/blog/2025/07/28/datafusion-49.0.0" rel="alternate"></link><published>2025-07-28T00:00:00+00:00</published><updated>2025-07-28T00:00:00+00:00</updated><author><name>pmc</name></author><id>tag:datafusion.apache.org,2025-07-28:/blog/2025/07/28/datafusion-49.0.0</id><summary type="html"><!-- {% comment %} Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with @@ -2374,224 +2591,7 @@ community</a>. We welcome first time contributors as well as long time par to the fun of building a database together.</p> <h2>Notes</h2> <p><a id="footnote7"></a><sup>[7]</sup> See <a href="https://btw-2015.informatik.uni-hamburg.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf">Unnesting Arbitrary Queries</a> from Neumann and Kemper for a more academic treatment.</p> -<p><a id="footnote8"></a><sup>[8]</sup> One of my favorite terms I learned from Andy Pavlo&rsquo;s CMU online lectures</p></content><category term="blog"></category></entry><entry><title>Field metadata and extension type support in user defined functions</title><link href="https://datafusion.apache.org/blog/2025/06/09/metadata-handling" rel="alternate"></link><published>2025-06-09T00:00:00+00:00</published><updated>2025-06-09T00:00:00+00:00</update [...] -{% comment %} -Licensed to the Apache Software Foundation (ASF) under one or more -contributor license agreements. See the NOTICE file distributed with -this work for additional information regarding copyright ownership. -The ASF licenses this file to you under the Apache License, Version 2.0 -(the "License"); you may not use this file except in compliance with -the License. You may obtain a copy of the License at -http://www.apache.org/licenses/LICENSE-2.0 -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. -{% endcomment %}x ---> -<p><a href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">DataFusion 48.0.0</a> introduced a change in the interface for writing custom functions -which enables a variety of interesting improvements. Now users can access metadata on -the input columns to functions and produce metadata in the output.</p> -<p>Metadata is specified as a map of key-value pairs of strings. This …</p></summary><content type="html"><!-- -{% comment %} -Licensed to the Apache Software Foundation (ASF) under one or more -contributor license agreements. See the NOTICE file distributed with -this work for additional information regarding copyright ownership. -The ASF licenses this file to you under the Apache License, Version 2.0 -(the "License"); you may not use this file except in compliance with -the License. You may obtain a copy of the License at -http://www.apache.org/licenses/LICENSE-2.0 -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. -{% endcomment %}x ---> -<p><a href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">DataFusion 48.0.0</a> introduced a change in the interface for writing custom functions -which enables a variety of interesting improvements. Now users can access metadata on -the input columns to functions and produce metadata in the output.</p> -<p>Metadata is specified as a map of key-value pairs of strings. This extra metadata is used -by Arrow implementations implement <a href="https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types">extension types</a> and can also be used to add -use case-specific context to a column of values where the formality of an extension type -is not required. In previous versions of DataFusion field metadata was propagated through -certain operations (e.g., renaming or selecting a column) but was not accessible to others -(e.g., scalar, window, or aggregate function calls). In the new implementation, during -processing of all user defined functions we pass the input field information and allow -user defined function implementations to return field information to the caller.</p> -<p><a href="https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types">Extension types</a> are user defined data types where the data is stored using one of the -existing <a href="https://arrow.apache.org/docs/format/Columnar.html#data-types">Arrow data types</a> but the metadata specifies how we are to interpret the -stored data. The use of extension types was one of the primary motivations for adding -metadata to the function processing, but arbitrary metadata can be put on the input and -output fields. This allows for a range of other interesting use cases.</p> -<h2>Why metadata handling is important</h2> -<p>Data in Arrow record batches carry a <code>Schema</code> in addition to the Arrow arrays. Each -<a href="https://arrow.apache.org/docs/format/Glossary.html#term-field">Field</a> in this <code>Schema</code> contains a name, data type, nullability, and metadata. The -metadata is specified as a map of key-value pairs of strings. In the new -implementation, during processing of all user defined functions we pass the input -field information.</p> -<figure> -<img alt="Relationship between a Record Batch, it's schema, and the underlying arrays. There is a one to one relationship between each Field in the Schema and Array entry in the Columns." class="img-responsive" src="/blog/images/metadata-handling/arrow_record_batch.png" width="100%"/> -<figcaption> - Relationship between a Record Batch, it's schema, and the underlying arrays. There is a one to one relationship between each Field in the Schema and Array entry in the Columns. - </figcaption> -</figure> -<p>It is often desirable to write a generic function for reuse. With the prior version of -user defined functions, we only had access to the <code>DataType</code> of the input columns. This -works well for some features that only rely on the types of data. Other use cases may -need additional information that describes the data.</p> -<p>For example, suppose I wish to write a function that takes in a UUID and returns a string -of the <a href="https://www.ietf.org/rfc/rfc9562.html#section-4.1">variant</a> of the input field. We would want this function to be able to handle -all of the string types and also a binary encoded UUID. The arrow specification does not -contain a unsigned 128 bit value, it is common to encode a UUID as a fixed sized binary -array where each element is 16 bytes long. With the metadata handling in <a href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">DataFusion 48.0.0</a> -we can validate during planning that the input data not only has the correct underlying -data type, but that it also represents the right <em>kind</em> of data. The UUID example is a -common one, and it is included in the <a href="https://arrow.apache.org/docs/format/CanonicalExtensions.html">canonical extension types</a> that are now -supported in DataFusion.</p> -<p>Another common application of metadata handling is understanding encoding of a blob of data. -Suppose you have a column that contains image data. Most likely this data is stored as -an array of <code>u8</code> data. Without knowing a priori what the encoding of that blob of data is, -you cannot ensure you are using the correct methods for decoding it. You may work around -this by adding another column to your data source indicating the encoding, but this can be -wasteful for systems where the encoding never changes. Instead, you could use metadata to -specify the encoding for the entire column.</p> -<h2>How to use metadata in user defined functions</h2> -<p>When working with metadata for user defined scalar functions, there are typically two -places in the function definition that require implementation.</p> -<ul> -<li>Computing the return field from the arguments</li> -<li>Invocation</li> -</ul> -<p>During planning, we will attempt to call the function <code>return_field_from_args()</code>. This will -provide a list of input fields to the function and return the output field. To evaluate -metadata on the input side, you can write a functions similar to this example:</p> -<pre><code class="language-rust">fn return_field_from_args( - &amp;self, - args: ReturnFieldArgs, -) -&gt; datafusion::common::Result&lt;FieldRef&gt; { - if args.arg_fields.len() != 1 { - return exec_err!("Incorrect number of arguments for uuid_version"); - } - - let input_field = &amp;args.arg_fields[0]; - if &amp;DataType::FixedSizeBinary(16) == input_field.data_type() { - let Ok(CanonicalExtensionType::Uuid(_)) = input_field.try_canonical_extension_type() - else { - return exec_err!("Input field must contain the UUID canonical extension type"); - }; - } - - Ok(Arc::new(Field::new(self.name(), DataType::UInt32, true))) -} -</code></pre> -<p>In this example, we take advantage of the fact that we already have support for extension -types that evaluate metadata. If you were attempting to check for metadata other than -extension type support, we could have instead written a snippet such as:</p> -<pre><code class="language-rust"> if &amp;DataType::FixedSizeBinary(16) == input_field.data_type() { - let _ = input_field - .metadata() - .get("ARROW:extension:metadata") - .ok_or(exec_datafusion_err!("Input field must contain the UUID canonical extension type"))?; - }; - } -</code></pre> -<p>If you are writing a user defined function that will instead return metadata on output -you can add this directly into the <code>Field</code> that is the output of the <code>return_field_from_args</code> -call. In our above example, we could change the return line to:</p> -<pre><code class="language-rust"> Ok(Arc::new( - Field::new(self.name(), DataType::UInt32, true).with_metadata( - [("my_key".to_string(), "my_value".to_string())] - .into_iter() - .collect(), - ), - )) -</code></pre> -<p>By checking the metadata during the planning process, we can identify errors early in -the query process. There are cases were we wish to have access to this metadata during -execution as well. The function <code>invoke_with_args</code> in the user defined function takes -the updated struct <code>ScalarFunctionArgs</code>. This now contains the input fields, which can -be used to check for metadata. For example, you can do the following:</p> -<pre><code class="language-rust">fn invoke_with_args(&amp;self, args: ScalarFunctionArgs) -&gt; Result&lt;ColumnarValue&gt; { - assert_eq!(args.arg_fields.len(), 1); - let my_value = args.arg_fields[0] - .metadata() - .get("encoding_type"); - ... -</code></pre> -<p>In this snippet we have extracted an <code>Option&lt;String&gt;</code> from the input field metadata -which we can then use to determine which functions we might want to call. We could -then parse the returned value to determine what type of encoding to use when -evaluating the array in the arguments. Since <code>return_field_from_args</code> is not <code>&amp;mut self</code> -this check could not be performed during the planning stage.</p> -<p>The description in this section applies to scalar user defined functions, but equivalent -support exists for aggregate and window functions.</p> -<h2>Extension types</h2> -<p>Extension types are one of the primary motivations for this enhancement in -<a href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">Datafusion 48.0.0</a>. The official Rust implementation of Apache Arrow, <a href="https://github.com/apache/arrow-rs">arrow-rs</a>, -already contains support for the <a href="https://arrow.apache.org/docs/format/CanonicalExtensions.html">canonical extension types</a>. This support includes -helper functions such as <code>try_canonical_extension_type()</code> in the earlier example.</p> -<p>For a concrete example of how extension types can be used in DataFusion functions, -there is an <a href="https://github.com/timsaucer/datafusion_extension_type_examples">example repository</a> that demonstrates using UUIDs. The UUID extension -type specifies that the data are stored as a Fixed Size Binary of length 16. In the -DataFusion core functions, we have the ability to generate string representations of -UUIDs that match the version 4 specification. These are helpful, but a user may -wish to do additional work with UUIDs where having them in the dense representation -is preferable. Alternatively, the user may already have data with the binary encoding -and we want to extract values such as the version, timestamp, or string -representation.</p> -<p>In the example repository we have created three user defined functions: <code>UuidVersion</code>, -<code>StringToUuid</code>, and <code>UuidToString</code>. Each of these implements <code>ScalarUDFImpl</code> and can -be used thusly:</p> -<pre><code class="language-rust">async fn main() -&gt; Result&lt;()&gt; { - let ctx = create_context()?; - - // get a DataFrame from the context - let mut df = ctx.table("t").await?; - - // Create the string UUIDs - df = df.select(vec![uuid().alias("string_uuid")])?; - - // Convert string UUIDs to canonical extension UUIDs - let string_to_uuid = ScalarUDF::new_from_impl(StringToUuid::default()); - df = df.with_column("uuid", string_to_uuid.call(vec![col("string_uuid")]))?; - - // Extract version number from canonical extension UUIDs - let version = ScalarUDF::new_from_impl(UuidVersion::default()); - df = df.with_column("version", version.call(vec![col("uuid")]))?; - - // Convert back to a string - let uuid_to_string = ScalarUDF::new_from_impl(UuidToString::default()); - df = df.with_column("string_round_trip", uuid_to_string.call(vec![col("uuid")]))?; - - df.show().await?; - - Ok(()) -} -</code></pre> -<p>The <a href="https://github.com/timsaucer/datafusion_extension_type_examples">example repository</a> also contains a crate that demonstrates how to expose these -UDFs to <a href="https://datafusion.apache.org/python/">datafusion-python</a>. This requires version 48.0.0 or later.</p> -<h2>Thanks to our sponsor</h2> -<p>We would like to thank <a href="https://rerun.io">Rerun.io</a> for sponsoring the development of this work. <a href="https://rerun.io">Rerun.io</a> -is building a data visualization system for Physical AI and uses metadata to specify -context about columns in Arrow record batches.</p> -<h2>Conclusion</h2> -<p>The enhancements to the metadata handling in <a href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">DataFusion 48.0.0</a> are a significant step -forward in the ability to handle more interesting types of data. We can validate the input -data matches not only the data types but also the intent of the data to be processed. We -can enable complex operations on binary data because we understand the encoding used. We -can also use metadata to create new and interesting user defined data types. </p> -<h2>Get Involved</h2> -<p>The DataFusion team is an active and engaging community and we would love to have you join -us and help the project.</p> -<p>Here are some ways to get involved:</p> -<ul> -<li>Learn more by visiting the <a href="https://datafusion.apache.org/index.html">DataFusion</a> project page.</li> -<li>Try out the project and provide feedback, file issues, and contribute code.</li> -<li>Work on a <a href="https://github.com/apache/datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22">good first issue</a>.</li> -<li>Reach out to us via the <a href="https://datafusion.apache.org/contributor-guide/communication.html">communication doc</a>.</li> -</ul></content><category term="blog"></category></entry><entry><title>Apache DataFusion Comet 0.8.0 Release</title><link href="https://datafusion.apache.org/blog/2025/05/06/datafusion-comet-0.8.0" rel="alternate"></link><published>2025-05-06T00:00:00+00:00</published><updated>2025-05-06T00:00:00+00:00</updated><author><name>pmc</name></author><id>tag:datafusion.apache.org,2025-05-06:/blog/2025/05/06/datafusion-comet-0.8.0</id><summary type="html"><!-- +<p><a id="footnote8"></a><sup>[8]</sup> One of my favorite terms I learned from Andy Pavlo&rsquo;s CMU online lectures</p></content><category term="blog"></category></entry><entry><title>Apache DataFusion Comet 0.8.0 Release</title><link href="https://datafusion.apache.org/blog/2025/05/06/datafusion-comet-0.8.0" rel="alternate"></link><published>2025-05-06T00:00:00+00:00</published><updated>2025-05-06T00:00:00+00:00</updated><author><name>pmc</name [...] {% comment %} Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with diff --git a/blog/feeds/tim-saucer-dewey-dunnington-andrew-lamb.atom.xml b/blog/feeds/tim-saucer-dewey-dunnington-andrew-lamb.atom.xml index e221645..a550385 100644 --- a/blog/feeds/tim-saucer-dewey-dunnington-andrew-lamb.atom.xml +++ b/blog/feeds/tim-saucer-dewey-dunnington-andrew-lamb.atom.xml @@ -1,5 +1,5 @@ <?xml version="1.0" encoding="utf-8"?> -<feed xmlns="http://www.w3.org/2005/Atom"><title>Apache DataFusion Blog - Tim Saucer, Dewey Dunnington, Andrew Lamb</title><link href="https://datafusion.apache.org/blog/" rel="alternate"></link><link href="https://datafusion.apache.org/blog/feeds/tim-saucer-dewey-dunnington-andrew-lamb.atom.xml" rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-06-09T00:00:00+00:00</updated><subtitle></subtitle><entry><title>Field metadata and extension type support in user def [...] +<feed xmlns="http://www.w3.org/2005/Atom"><title>Apache DataFusion Blog - Tim Saucer, Dewey Dunnington, Andrew Lamb</title><link href="https://datafusion.apache.org/blog/" rel="alternate"></link><link href="https://datafusion.apache.org/blog/feeds/tim-saucer-dewey-dunnington-andrew-lamb.atom.xml" rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-07-29T00:00:00+00:00</updated><subtitle></subtitle><entry><title>Field metadata and extension type support in user def [...] {% comment %} Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with diff --git a/blog/feeds/tim-saucer-dewey-dunnington-andrew-lamb.rss.xml b/blog/feeds/tim-saucer-dewey-dunnington-andrew-lamb.rss.xml index 5606b92..e923022 100644 --- a/blog/feeds/tim-saucer-dewey-dunnington-andrew-lamb.rss.xml +++ b/blog/feeds/tim-saucer-dewey-dunnington-andrew-lamb.rss.xml @@ -1,5 +1,5 @@ <?xml version="1.0" encoding="utf-8"?> -<rss version="2.0"><channel><title>Apache DataFusion Blog - Tim Saucer, Dewey Dunnington, Andrew Lamb</title><link>https://datafusion.apache.org/blog/</link><description></description><lastBuildDate>Mon, 09 Jun 2025 00:00:00 +0000</lastBuildDate><item><title>Field metadata and extension type support in user defined functions</title><link>https://datafusion.apache.org/blog/2025/06/09/metadata-handling</link><description><!-- +<rss version="2.0"><channel><title>Apache DataFusion Blog - Tim Saucer, Dewey Dunnington, Andrew Lamb</title><link>https://datafusion.apache.org/blog/</link><description></description><lastBuildDate>Tue, 29 Jul 2025 00:00:00 +0000</lastBuildDate><item><title>Field metadata and extension type support in user defined functions</title><link>https://datafusion.apache.org/blog/2025/07/29/metadata-handling</link><description><!-- {% comment %} Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with @@ -18,4 +18,4 @@ limitations under the License. <p><a href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">DataFusion 48.0.0</a> introduced a change in the interface for writing custom functions which enables a variety of interesting improvements. Now users can access metadata on the input columns to functions and produce metadata in the output.</p> -<p>Metadata is specified as a map of key-value pairs of strings. This …</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tim Saucer, Dewey Dunnington, Andrew Lamb</dc:creator><pubDate>Mon, 09 Jun 2025 00:00:00 +0000</pubDate><guid isPermaLink="false">tag:datafusion.apache.org,2025-06-09:/blog/2025/06/09/metadata-handling</guid><category>blog</category></item></channel></rss> \ No newline at end of file +<p>Metadata is specified as a map of key-value pairs of strings. This …</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tim Saucer, Dewey Dunnington, Andrew Lamb</dc:creator><pubDate>Tue, 29 Jul 2025 00:00:00 +0000</pubDate><guid isPermaLink="false">tag:datafusion.apache.org,2025-07-29:/blog/2025/07/29/metadata-handling</guid><category>blog</category></item></channel></rss> \ No newline at end of file diff --git a/blog/index.html b/blog/index.html index 21c06dc..3fb78c7 100644 --- a/blog/index.html +++ b/blog/index.html @@ -44,6 +44,44 @@ <p><i>Here you can find the latest updates from DataFusion and related projects.</i></p> + <!-- Post --> + <div class="row"> + <div class="callout"> + <article class="post"> + <header> + <div class="title"> + <h1><a href="/blog/2025/07/29/metadata-handling">Field metadata and extension type support in user defined functions</a></h1> + <p>Posted on: Tue 29 July 2025 by Tim Saucer, Dewey Dunnington, Andrew Lamb</p> + <p><!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at +http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %}x +--> +<p><a href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">DataFusion 48.0.0</a> introduced a change in the interface for writing custom functions +which enables a variety of interesting improvements. Now users can access metadata on +the input columns to functions and produce metadata in the output.</p> +<p>Metadata is specified as a map of key-value pairs of strings. This …</p></p> + <footer> + <ul class="actions"> + <div style="text-align: right"><a href="/blog/2025/07/29/metadata-handling" class="button medium">Continue Reading</a></div> + </ul> + <ul class="stats"> + </ul> + </footer> + </article> + </div> + </div> <!-- Post --> <div class="row"> <div class="callout"> @@ -387,44 +425,6 @@ DataFusion</a> and …</p></p> </div> </div> <!-- Post --> - <div class="row"> - <div class="callout"> - <article class="post"> - <header> - <div class="title"> - <h1><a href="/blog/2025/06/09/metadata-handling">Field metadata and extension type support in user defined functions</a></h1> - <p>Posted on: Mon 09 June 2025 by Tim Saucer, Dewey Dunnington, Andrew Lamb</p> - <p><!-- -{% comment %} -Licensed to the Apache Software Foundation (ASF) under one or more -contributor license agreements. See the NOTICE file distributed with -this work for additional information regarding copyright ownership. -The ASF licenses this file to you under the Apache License, Version 2.0 -(the "License"); you may not use this file except in compliance with -the License. You may obtain a copy of the License at -http://www.apache.org/licenses/LICENSE-2.0 -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. -{% endcomment %}x ---> -<p><a href="https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/">DataFusion 48.0.0</a> introduced a change in the interface for writing custom functions -which enables a variety of interesting improvements. Now users can access metadata on -the input columns to functions and produce metadata in the output.</p> -<p>Metadata is specified as a map of key-value pairs of strings. This …</p></p> - <footer> - <ul class="actions"> - <div style="text-align: right"><a href="/blog/2025/06/09/metadata-handling" class="button medium">Continue Reading</a></div> - </ul> - <ul class="stats"> - </ul> - </footer> - </article> - </div> - </div> - <!-- Post --> <div class="row"> <div class="callout"> <article class="post"> --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@datafusion.apache.org For additional commands, e-mail: commits-h...@datafusion.apache.org