This is an automated email from the ASF dual-hosted git repository.
github-bot pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/datafusion-site.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 6f9968b Commit build products
6f9968b is described below
commit 6f9968bc95e113ce2c11c57cb2e5c72b2e1ca434
Author: Build Pelican (action) <[email protected]>
AuthorDate: Fri Feb 6 01:47:45 2026 +0000
Commit build products
---
output/2022/02/28/datafusion-7.0.0/index.html | 2 +-
output/2023/01/19/datafusion-16.0.0/index.html | 2 +-
output/2024/01/19/datafusion-34.0.0/index.html | 2 +-
.../2024/08/20/python-datafusion-40.0.0/index.html | 2 +-
.../index.html | 4 ++--
.../datafusion-python-udf-comparisons/index.html | 8 +++----
.../2024/12/14/datafusion-python-43.1.0/index.html | 4 ++--
.../2025/03/30/datafusion-python-46.0.0/index.html | 2 +-
output/feeds/all-en.atom.xml | 26 +++++++++++-----------
output/feeds/blog.atom.xml | 26 +++++++++++-----------
output/feeds/pmc.atom.xml | 6 ++---
output/feeds/timsaucer.atom.xml | 16 ++++++-------
output/feeds/xiangpeng-hao-andrew-lamb.atom.xml | 4 ++--
13 files changed, 52 insertions(+), 52 deletions(-)
diff --git a/output/2022/02/28/datafusion-7.0.0/index.html
b/output/2022/02/28/datafusion-7.0.0/index.html
index 808cc12..9067f72 100644
--- a/output/2022/02/28/datafusion-7.0.0/index.html
+++ b/output/2022/02/28/datafusion-7.0.0/index.html
@@ -125,7 +125,7 @@ git shortlog -sn 5.0.0..6.0.0 datafusion datafusion-cli
datafusion-examples | wc
<li>Switch from <code>std::sync::Mutex</code> to
<code>parking_lot::Mutex</code> <a
href="https://github.com/apache/arrow-datafusion/pull/1720">#1720</a></li>
<li>New Features</li>
<li>Support for memory tracking and spilling to disk<ul>
-<li>MemoryMananger and DiskManager <a
href="https://github.com/apache/arrow-datafusion/pull/1526">#1526</a></li>
+<li>MemoryManager and DiskManager <a
href="https://github.com/apache/arrow-datafusion/pull/1526">#1526</a></li>
<li>Out of core sort <a
href="https://github.com/apache/arrow-datafusion/pull/1526">#1526</a></li>
<li>New metrics</li>
<li><code>Gauge</code> and <code>CurrentMemoryUsage</code> <a
href="https://github.com/apache/arrow-datafusion/pull/1682">#1682</a></li>
diff --git a/output/2023/01/19/datafusion-16.0.0/index.html
b/output/2023/01/19/datafusion-16.0.0/index.html
index ddfa5ae..103c43f 100644
--- a/output/2023/01/19/datafusion-16.0.0/index.html
+++ b/output/2023/01/19/datafusion-16.0.0/index.html
@@ -192,7 +192,7 @@ required synchronous access to all relevant catalog
information.</p>
<li>Automatic coercions ast between Date and Timestamp <a
href="https://github.com/apache/arrow-datafusion/issues/4726">#4726</a></li>
<li>Support type coercion for timestamp and utf8 <a
href="https://github.com/apache/arrow-datafusion/issues/4312">#4312</a></li>
<li>Full support for time32 and time64 literal values
(<code>ScalarValue</code>) <a
href="https://github.com/apache/arrow-datafusion/issues/4156">#4156</a></li>
-<li>New functions, incuding <code>uuid()</code> <a
href="https://github.com/apache/arrow-datafusion/issues/4041">#4041</a>,
<code>current_time</code> <a
href="https://github.com/apache/arrow-datafusion/issues/4054">#4054</a>,
<code>current_date</code> <a
href="https://github.com/apache/arrow-datafusion/issues/4022">#4022</a></li>
+<li>New functions, including <code>uuid()</code> <a
href="https://github.com/apache/arrow-datafusion/issues/4041">#4041</a>,
<code>current_time</code> <a
href="https://github.com/apache/arrow-datafusion/issues/4054">#4054</a>,
<code>current_date</code> <a
href="https://github.com/apache/arrow-datafusion/issues/4022">#4022</a></li>
<li>Compressed CSV/JSON support <a
href="https://github.com/apache/arrow-datafusion/issues/3642">#3642</a></li>
</ul>
<p>The community has also invested in new <a
href="https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/tests/sqllogictests/README.md">sqllogic
based</a> tests to keep improving DataFusion's quality with less effort.</p>
diff --git a/output/2024/01/19/datafusion-34.0.0/index.html
b/output/2024/01/19/datafusion-34.0.0/index.html
index 9740d49..25ea78d 100644
--- a/output/2024/01/19/datafusion-34.0.0/index.html
+++ b/output/2024/01/19/datafusion-34.0.0/index.html
@@ -256,7 +256,7 @@ LIMIT 3;
3 rows in set. Query took 0.053 seconds.
</code></pre>
<h3 id="growth-of-datafusion">Growth of DataFusion 📈<a class="headerlink"
href="#growth-of-datafusion" title="Permanent link">¶</a></h3>
-<p>DataFusion has been appearing more publically in the wild. For example
+<p>DataFusion has been appearing more publicly in the wild. For example
* New projects built using DataFusion such as <a
href="https://lancedb.com/">lancedb</a>, <a
href="https://glaredb.com/">GlareDB</a>, <a
href="https://www.arroyo.dev/">Arroyo</a>, and <a
href="https://github.com/cmu-db/optd">optd</a>.
* Public talks such as <a
href="https://www.youtube.com/watch?v=AJU9rdRNk9I">Apache Arrow Datafusion:
Vectorized
Execution Framework For Maximum Performance</a> in <a
href="https://www.bagevent.com/event/8432178">CommunityOverCode Asia 2023</a>
diff --git a/output/2024/08/20/python-datafusion-40.0.0/index.html
b/output/2024/08/20/python-datafusion-40.0.0/index.html
index af3d973..6a0276e 100644
--- a/output/2024/08/20/python-datafusion-40.0.0/index.html
+++ b/output/2024/08/20/python-datafusion-40.0.0/index.html
@@ -105,7 +105,7 @@ to their Rust counterparts.</li>
<p>The most significant difference is that we have added wrapper functions and
classes for most of the
user facing interface. These wrappers, written in Python, contain both
documentation and type
annotations.</p>
-<p>This documenation is now available on the <a
href="https://datafusion.apache.org/python/autoapi/datafusion/index.html">DataFusion
in Python API</a> website. There you can browse
+<p>This documentation is now available on the <a
href="https://datafusion.apache.org/python/autoapi/datafusion/index.html">DataFusion
in Python API</a> website. There you can browse
the available functions and classes to see the breadth of available
functionality.</p>
<p>Modern IDEs use language servers such as
<a
href="https://marketplace.visualstudio.com/items?itemName=ms-python.vscode-pylance">Pylance</a>
or
diff --git
a/output/2024/09/13/string-view-german-style-strings-part-2/index.html
b/output/2024/09/13/string-view-german-style-strings-part-2/index.html
index 7712d8e..9f2ac71 100644
--- a/output/2024/09/13/string-view-german-style-strings-part-2/index.html
+++ b/output/2024/09/13/string-view-german-style-strings-part-2/index.html
@@ -107,8 +107,8 @@ Figure 1 illustrates the difference between the output of
both string representa
<h1 id="when-to-gc">When to GC?<a class="headerlink" href="#when-to-gc"
title="Permanent link">¶</a></h1>
<p>Zero-copy <code>take/filter</code> is great for generating large arrays
quickly, but it is suboptimal for highly selective filters, where most of the
strings are filtered out. When the cardinality drops, StringViewArray buffers
become sparse—only a small subset of the bytes in the buffer’s memory are
referred to by any <code>view</code>. This leads to excessive memory usage,
especially in a <a
href="https://github.com/apache/datafusion/issues/11628">filter-then-coalesce
scenario</a>. [...]
<p>To release unused memory, we implemented a <a
href="https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html#method.gc">garbage
collection (GC)</a> routine to consolidate the data into a new buffer to
release the old sparse buffer(s). As the GC operation copies strings, similarly
to StringArray, we must be careful about when to call it. If we call GC too
early, we cause unnecessary copying, losing much of the benefit of
StringViewArray. If we call GC too late, we hold [...]
-<p><code>arrow-rs</code> implements the GC process, but it is up to users to
decide when to call it. We leverage the semantics of the query engine and
observed that the <a
href="https://docs.rs/datafusion/latest/datafusion/physical_plan/coalesce_batches/struct.CoalesceBatchesExec.html"><code>CoalseceBatchesExec</code></a>
operator, which merge smaller batches to a larger batch, is often used after
the record cardinality is expected to shrink, which aligns perfectly with the
scenario of G [...]
-We, therefore,<a href="https://github.com/apache/datafusion/pull/11587">
implemented the GC procedure</a> inside <code>CoalseceBatchesExec</code>[^5]
with a heuristic that estimates when the buffers are too sparse.</p>
+<p><code>arrow-rs</code> implements the GC process, but it is up to users to
decide when to call it. We leverage the semantics of the query engine and
observed that the <a
href="https://docs.rs/datafusion/latest/datafusion/physical_plan/coalesce_batches/struct.CoalesceBatchesExec.html"><code>CoalesceBatchesExec</code></a>
operator, which merge smaller batches to a larger batch, is often used after
the record cardinality is expected to shrink, which aligns perfectly with the
scenario of G [...]
+We, therefore,<a href="https://github.com/apache/datafusion/pull/11587">
implemented the GC procedure</a> inside <code>CoalesceBatchesExec</code>[^5]
with a heuristic that estimates when the buffers are too sparse.</p>
<h2 id="the-art-of-function-inlining-not-too-much-not-too-little">The art of
function inlining: not too much, not too little<a class="headerlink"
href="#the-art-of-function-inlining-not-too-much-not-too-little"
title="Permanent link">¶</a></h2>
<p>Like string inlining, <em>function</em> inlining is the process of
embedding a short function into the caller to avoid the overhead of function
calls (caller/callee save).
Usually, the Rust compiler does a good job of deciding when to inline.
However, it is possible to override its default using the <a
href="https://doc.rust-lang.org/reference/attributes/codegen.html#the-inline-attribute"><code>#[inline(always)]</code>
directive</a>.
diff --git a/output/2024/11/19/datafusion-python-udf-comparisons/index.html
b/output/2024/11/19/datafusion-python-udf-comparisons/index.html
index b6943f4..9a8443d 100644
--- a/output/2024/11/19/datafusion-python-udf-comparisons/index.html
+++ b/output/2024/11/19/datafusion-python-udf-comparisons/index.html
@@ -149,7 +149,7 @@ than a join can be significantly faster. This is worth
profiling for your specif
<p>I have a DataFrame with many values that I want to aggregate. I have
already analyzed it and
determined there is a noise level below which I do not want to include in my
analysis. I want to
compute a sum of only values that are above my noise threshold.</p>
-<p>This can be done fairly easy without leaning on a User Defined Aggegate
Function (UDAF). You can
+<p>This can be done fairly easy without leaning on a User Defined Aggregate
Function (UDAF). You can
simply filter the DataFrame and then aggregate using the built-in
<code>sum</code> function. Here, we
demonstrate doing this as a UDF primarily as an example of how to write UDAFs.
We will use the
PyArrow compute approach.</p>
@@ -310,7 +310,7 @@ Python, is to primarily demonstrate how to make the Python
to Rust with Python w
transition. In the second implementation you can see how we can iterate
through all of the arrays
ourselves.</p>
<p>In this first example, we are hard coding the values of interest, but in
the following section
-we demonstrate passing these in during initalization.</p>
+we demonstrate passing these in during initialization.</p>
<pre><code class="language-rust">#[pyfunction]
pub fn tuple_filter_fn(
py: Python<'_>,
@@ -533,13 +533,13 @@ how much they have ordered total. We want to ignore small
orders, which we defin
import pyarrow as pa
import pyarrow.compute as pc
-IGNORE_THESHOLD = 5000.0
+IGNORE_THRESHOLD = 5000.0
class AboveThresholdAccum(Accumulator):
def __init__(self) -> None:
self._sum = 0.0
def update(self, values: pa.Array) -> None:
- over_threshold = pc.greater(values, pa.scalar(IGNORE_THESHOLD))
+ over_threshold = pc.greater(values, pa.scalar(IGNORE_THRESHOLD))
sum_above = pc.sum(values.filter(over_threshold)).as_py()
if sum_above is None:
sum_above = 0.0
diff --git a/output/2024/12/14/datafusion-python-43.1.0/index.html
b/output/2024/12/14/datafusion-python-43.1.0/index.html
index ac438a2..1d61bc6 100644
--- a/output/2024/12/14/datafusion-python-43.1.0/index.html
+++ b/output/2024/12/14/datafusion-python-43.1.0/index.html
@@ -95,7 +95,7 @@ consistent method for exposing these data structures across
libraries.</p>
<p>In <a href="https://github.com/apache/datafusion-python/pull/825">PR
#825</a>, we introduced support for both importing and exporting Arrow data in
<code>datafusion-python</code>. With this improvement, you can now use a
single function call to import
a table from <strong>any</strong> Python library that implements the <a
href="https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html">Arrow
PyCapsule Interface</a>.
-Many popular libaries, such as <a href="https://pandas.pydata.org/">Pandas</a>
and <a href="https://pola.rs/">Polars</a>
+Many popular libraries, such as <a
href="https://pandas.pydata.org/">Pandas</a> and <a
href="https://pola.rs/">Polars</a>
already support these interfaces.</p>
<p>Suppose you have a Pandas and Polars DataFrames named
<code>df_pandas</code> or <code>df_polars</code>, respectively:</p>
<pre><code class="language-python">ctx = SessionContext()
@@ -155,7 +155,7 @@ of the blog post describing how these enhancements can lead
to 20-200% performan
gains in some tests.</p>
<p>During our testing we identified some cases where we needed to adjust
workflows to
account for the fact that StringView is now the default type for string based
operations.
-First, when performing manipulations on string objects there is a perfomance
loss when
+First, when performing manipulations on string objects there is a performance
loss when
needing to cast from string to string view or vice versa. To reap the best
performance,
ideally all of your string type data will use StringView. For most users this
should be
transparent. However if you specify a schema for reading or creating data,
then you
diff --git a/output/2025/03/30/datafusion-python-46.0.0/index.html
b/output/2025/03/30/datafusion-python-46.0.0/index.html
index b570e0f..7101db4 100644
--- a/output/2025/03/30/datafusion-python-46.0.0/index.html
+++ b/output/2025/03/30/datafusion-python-46.0.0/index.html
@@ -117,7 +117,7 @@ to register the view and then use it in another place:</p>
<pre><code class="language-python">ctx.register_view("view1", df1)
</code></pre>
<p>And then in another portion of your code which has access to the same
session context
-you can retrive the DataFrame with:</p>
+you can retrieve the DataFrame with:</p>
<pre><code>df2 = ctx.table("view1")
</code></pre>
<h2 id="asynchronous-iteration-of-record-batches">Asynchronous Iteration of
Record Batches<a class="headerlink"
href="#asynchronous-iteration-of-record-batches" title="Permanent
link">¶</a></h2>
diff --git a/output/feeds/all-en.atom.xml b/output/feeds/all-en.atom.xml
index b283d32..b42ac30 100644
--- a/output/feeds/all-en.atom.xml
+++ b/output/feeds/all-en.atom.xml
@@ -7056,7 +7056,7 @@ to register the view and then use it in another
place:</p>
<pre><code class="language-python">ctx.register_view("view1", df1)
</code></pre>
<p>And then in another portion of your code which has access to the same
session context
-you can retrive the DataFrame with:</p>
+you can retrieve the DataFrame with:</p>
<pre><code>df2 = ctx.table("view1")
</code></pre>
<h2 id="asynchronous-iteration-of-record-batches">Asynchronous Iteration
of Record Batches<a class="headerlink"
href="#asynchronous-iteration-of-record-batches" title="Permanent
link">¶</a></h2>
@@ -8690,7 +8690,7 @@ consistent method for exposing these data structures
across libraries.</p>
<p>In <a
href="https://github.com/apache/datafusion-python/pull/825">PR
#825</a>, we introduced support for both importing and exporting Arrow
data in
<code>datafusion-python</code>. With this improvement, you can now
use a single function call to import
a table from <strong>any</strong> Python library that implements
the <a
href="https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html">Arrow
PyCapsule Interface</a>.
-Many popular libaries, such as <a
href="https://pandas.pydata.org/">Pandas</a> and <a
href="https://pola.rs/">Polars</a>
+Many popular libraries, such as <a
href="https://pandas.pydata.org/">Pandas</a> and <a
href="https://pola.rs/">Polars</a>
already support these interfaces.</p>
<p>Suppose you have a Pandas and Polars DataFrames named
<code>df_pandas</code> or <code>df_polars</code>,
respectively:</p>
<pre><code class="language-python">ctx = SessionContext()
@@ -8750,7 +8750,7 @@ of the blog post describing how these enhancements can
lead to 20-200% performan
gains in some tests.</p>
<p>During our testing we identified some cases where we needed to adjust
workflows to
account for the fact that StringView is now the default type for string based
operations.
-First, when performing manipulations on string objects there is a perfomance
loss when
+First, when performing manipulations on string objects there is a performance
loss when
needing to cast from string to string view or vice versa. To reap the best
performance,
ideally all of your string type data will use StringView. For most users this
should be
transparent. However if you specify a schema for reading or creating data,
then you
@@ -8987,7 +8987,7 @@ than a join can be significantly faster. This is worth
profiling for your specif
<p>I have a DataFrame with many values that I want to aggregate. I have
already analyzed it and
determined there is a noise level below which I do not want to include in my
analysis. I want to
compute a sum of only values that are above my noise threshold.</p>
-<p>This can be done fairly easy without leaning on a User Defined
Aggegate Function (UDAF). You can
+<p>This can be done fairly easy without leaning on a User Defined
Aggregate Function (UDAF). You can
simply filter the DataFrame and then aggregate using the built-in
<code>sum</code> function. Here, we
demonstrate doing this as a UDF primarily as an example of how to write UDAFs.
We will use the
PyArrow compute approach.</p>
@@ -9148,7 +9148,7 @@ Python, is to primarily demonstrate how to make the
Python to Rust with Python w
transition. In the second implementation you can see how we can iterate
through all of the arrays
ourselves.</p>
<p>In this first example, we are hard coding the values of interest, but
in the following section
-we demonstrate passing these in during initalization.</p>
+we demonstrate passing these in during initialization.</p>
<pre><code class="language-rust">#[pyfunction]
pub fn tuple_filter_fn(
py: Python&lt;'_&gt;,
@@ -9371,13 +9371,13 @@ how much they have ordered total. We want to ignore
small orders, which we defin
import pyarrow as pa
import pyarrow.compute as pc
-IGNORE_THESHOLD = 5000.0
+IGNORE_THRESHOLD = 5000.0
class AboveThresholdAccum(Accumulator):
def __init__(self) -&gt; None:
self._sum = 0.0
def update(self, values: pa.Array) -&gt; None:
- over_threshold = pc.greater(values, pa.scalar(IGNORE_THESHOLD))
+ over_threshold = pc.greater(values, pa.scalar(IGNORE_THRESHOLD))
sum_above = pc.sum(values.filter(over_threshold)).as_py()
if sum_above is None:
sum_above = 0.0
@@ -9952,8 +9952,8 @@ Figure 1 illustrates the difference between the output of
both string representa
<h1 id="when-to-gc">When to GC?<a class="headerlink"
href="#when-to-gc" title="Permanent link">¶</a></h1>
<p>Zero-copy <code>take/filter</code> is great for
generating large arrays quickly, but it is suboptimal for highly selective
filters, where most of the strings are filtered out. When the cardinality
drops, StringViewArray buffers become sparse—only a small subset of the bytes
in the buffer’s memory are referred to by any <code>view</code>.
This leads to excessive memory usage, especially in a <a
href="https://github.com/apache/datafusion/issues/11628"> [...]
<p>To release unused memory, we implemented a <a
href="https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html#method.gc">garbage
collection (GC)</a> routine to consolidate the data into a new buffer to
release the old sparse buffer(s). As the GC operation copies strings, similarly
to StringArray, we must be careful about when to call it. If we call GC too
early, we cause unnecessary copying, losing much of the benefit of
StringViewArray. If we call GC [...]
-<p><code>arrow-rs</code> implements the GC process, but it
is up to users to decide when to call it. We leverage the semantics of the
query engine and observed that the <a
href="https://docs.rs/datafusion/latest/datafusion/physical_plan/coalesce_batches/struct.CoalesceBatchesExec.html"><code>CoalseceBatchesExec</code></a>
operator, which merge smaller batches to a larger batch, is often used after
the record cardinality is expected to shrink, whi [...]
-We, therefore,<a href="https://github.com/apache/datafusion/pull/11587">
implemented the GC procedure</a> inside
<code>CoalseceBatchesExec</code>[^5] with a heuristic that
estimates when the buffers are too sparse.</p>
+<p><code>arrow-rs</code> implements the GC process, but it
is up to users to decide when to call it. We leverage the semantics of the
query engine and observed that the <a
href="https://docs.rs/datafusion/latest/datafusion/physical_plan/coalesce_batches/struct.CoalesceBatchesExec.html"><code>CoalesceBatchesExec</code></a>
operator, which merge smaller batches to a larger batch, is often used after
the record cardinality is expected to shrink, whi [...]
+We, therefore,<a href="https://github.com/apache/datafusion/pull/11587">
implemented the GC procedure</a> inside
<code>CoalesceBatchesExec</code>[^5] with a heuristic that
estimates when the buffers are too sparse.</p>
<h2 id="the-art-of-function-inlining-not-too-much-not-too-little">The
art of function inlining: not too much, not too little<a class="headerlink"
href="#the-art-of-function-inlining-not-too-much-not-too-little"
title="Permanent link">¶</a></h2>
<p>Like string inlining, <em>function</em> inlining is the
process of embedding a short function into the caller to avoid the overhead of
function calls (caller/callee save).
Usually, the Rust compiler does a good job of deciding when to inline.
However, it is possible to override its default using the <a
href="https://doc.rust-lang.org/reference/attributes/codegen.html#the-inline-attribute"><code>#[inline(always)]</code>
directive</a>.
@@ -10152,7 +10152,7 @@ to their Rust counterparts.</li>
<p>The most significant difference is that we have added wrapper
functions and classes for most of the
user facing interface. These wrappers, written in Python, contain both
documentation and type
annotations.</p>
-<p>This documenation is now available on the <a
href="https://datafusion.apache.org/python/autoapi/datafusion/index.html">DataFusion
in Python API</a> website. There you can browse
+<p>This documentation is now available on the <a
href="https://datafusion.apache.org/python/autoapi/datafusion/index.html">DataFusion
in Python API</a> website. There you can browse
the available functions and classes to see the breadth of available
functionality.</p>
<p>Modern IDEs use language servers such as
<a
href="https://marketplace.visualstudio.com/items?itemName=ms-python.vscode-pylance">Pylance</a>
or
@@ -11076,7 +11076,7 @@ LIMIT 3;
3 rows in set. Query took 0.053 seconds.
</code></pre>
<h3 id="growth-of-datafusion">Growth of DataFusion 📈<a
class="headerlink" href="#growth-of-datafusion" title="Permanent
link">¶</a></h3>
-<p>DataFusion has been appearing more publically in the wild. For example
+<p>DataFusion has been appearing more publicly in the wild. For example
* New projects built using DataFusion such as <a
href="https://lancedb.com/">lancedb</a>, <a
href="https://glaredb.com/">GlareDB</a>, <a
href="https://www.arroyo.dev/">Arroyo</a>, and <a
href="https://github.com/cmu-db/optd">optd</a>.
* Public talks such as <a
href="https://www.youtube.com/watch?v=AJU9rdRNk9I">Apache Arrow Datafusion:
Vectorized
Execution Framework For Maximum Performance</a> in <a
href="https://www.bagevent.com/event/8432178">CommunityOverCode Asia
2023</a>
@@ -11828,7 +11828,7 @@ required synchronous access to all relevant catalog
information.</p>
<li>Automatic coercions ast between Date and Timestamp <a
href="https://github.com/apache/arrow-datafusion/issues/4726">#4726</a></li>
<li>Support type coercion for timestamp and utf8 <a
href="https://github.com/apache/arrow-datafusion/issues/4312">#4312</a></li>
<li>Full support for time32 and time64 literal values
(<code>ScalarValue</code>) <a
href="https://github.com/apache/arrow-datafusion/issues/4156">#4156</a></li>
-<li>New functions, incuding <code>uuid()</code> <a
href="https://github.com/apache/arrow-datafusion/issues/4041">#4041</a>,
<code>current_time</code> <a
href="https://github.com/apache/arrow-datafusion/issues/4054">#4054</a>,
<code>current_date</code> <a
href="https://github.com/apache/arrow-datafusion/issues/4022">#4022</a></li>
+<li>New functions, including <code>uuid()</code> <a
href="https://github.com/apache/arrow-datafusion/issues/4041">#4041</a>,
<code>current_time</code> <a
href="https://github.com/apache/arrow-datafusion/issues/4054">#4054</a>,
<code>current_date</code> <a
href="https://github.com/apache/arrow-datafusion/issues/4022">#4022</a></li>
<li>Compressed CSV/JSON support <a
href="https://github.com/apache/arrow-datafusion/issues/3642">#3642</a></li>
</ul>
<p>The community has also invested in new <a
href="https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/tests/sqllogictests/README.md">sqllogic
based</a> tests to keep improving DataFusion's quality with less
effort.</p>
@@ -12680,7 +12680,7 @@ git shortlog -sn 5.0.0..6.0.0 datafusion datafusion-cli
datafusion-examples | wc
<li>Switch from <code>std::sync::Mutex</code> to
<code>parking_lot::Mutex</code> <a
href="https://github.com/apache/arrow-datafusion/pull/1720">#1720</a></li>
<li>New Features</li>
<li>Support for memory tracking and spilling to disk<ul>
-<li>MemoryMananger and DiskManager <a
href="https://github.com/apache/arrow-datafusion/pull/1526">#1526</a></li>
+<li>MemoryManager and DiskManager <a
href="https://github.com/apache/arrow-datafusion/pull/1526">#1526</a></li>
<li>Out of core sort <a
href="https://github.com/apache/arrow-datafusion/pull/1526">#1526</a></li>
<li>New metrics</li>
<li><code>Gauge</code> and
<code>CurrentMemoryUsage</code> <a
href="https://github.com/apache/arrow-datafusion/pull/1682">#1682</a></li>
diff --git a/output/feeds/blog.atom.xml b/output/feeds/blog.atom.xml
index aeeec57..df7fbb2 100644
--- a/output/feeds/blog.atom.xml
+++ b/output/feeds/blog.atom.xml
@@ -7056,7 +7056,7 @@ to register the view and then use it in another
place:</p>
<pre><code class="language-python">ctx.register_view("view1", df1)
</code></pre>
<p>And then in another portion of your code which has access to the same
session context
-you can retrive the DataFrame with:</p>
+you can retrieve the DataFrame with:</p>
<pre><code>df2 = ctx.table("view1")
</code></pre>
<h2 id="asynchronous-iteration-of-record-batches">Asynchronous Iteration
of Record Batches<a class="headerlink"
href="#asynchronous-iteration-of-record-batches" title="Permanent
link">¶</a></h2>
@@ -8690,7 +8690,7 @@ consistent method for exposing these data structures
across libraries.</p>
<p>In <a
href="https://github.com/apache/datafusion-python/pull/825">PR
#825</a>, we introduced support for both importing and exporting Arrow
data in
<code>datafusion-python</code>. With this improvement, you can now
use a single function call to import
a table from <strong>any</strong> Python library that implements
the <a
href="https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html">Arrow
PyCapsule Interface</a>.
-Many popular libaries, such as <a
href="https://pandas.pydata.org/">Pandas</a> and <a
href="https://pola.rs/">Polars</a>
+Many popular libraries, such as <a
href="https://pandas.pydata.org/">Pandas</a> and <a
href="https://pola.rs/">Polars</a>
already support these interfaces.</p>
<p>Suppose you have a Pandas and Polars DataFrames named
<code>df_pandas</code> or <code>df_polars</code>,
respectively:</p>
<pre><code class="language-python">ctx = SessionContext()
@@ -8750,7 +8750,7 @@ of the blog post describing how these enhancements can
lead to 20-200% performan
gains in some tests.</p>
<p>During our testing we identified some cases where we needed to adjust
workflows to
account for the fact that StringView is now the default type for string based
operations.
-First, when performing manipulations on string objects there is a perfomance
loss when
+First, when performing manipulations on string objects there is a performance
loss when
needing to cast from string to string view or vice versa. To reap the best
performance,
ideally all of your string type data will use StringView. For most users this
should be
transparent. However if you specify a schema for reading or creating data,
then you
@@ -8987,7 +8987,7 @@ than a join can be significantly faster. This is worth
profiling for your specif
<p>I have a DataFrame with many values that I want to aggregate. I have
already analyzed it and
determined there is a noise level below which I do not want to include in my
analysis. I want to
compute a sum of only values that are above my noise threshold.</p>
-<p>This can be done fairly easy without leaning on a User Defined
Aggegate Function (UDAF). You can
+<p>This can be done fairly easy without leaning on a User Defined
Aggregate Function (UDAF). You can
simply filter the DataFrame and then aggregate using the built-in
<code>sum</code> function. Here, we
demonstrate doing this as a UDF primarily as an example of how to write UDAFs.
We will use the
PyArrow compute approach.</p>
@@ -9148,7 +9148,7 @@ Python, is to primarily demonstrate how to make the
Python to Rust with Python w
transition. In the second implementation you can see how we can iterate
through all of the arrays
ourselves.</p>
<p>In this first example, we are hard coding the values of interest, but
in the following section
-we demonstrate passing these in during initalization.</p>
+we demonstrate passing these in during initialization.</p>
<pre><code class="language-rust">#[pyfunction]
pub fn tuple_filter_fn(
py: Python&lt;'_&gt;,
@@ -9371,13 +9371,13 @@ how much they have ordered total. We want to ignore
small orders, which we defin
import pyarrow as pa
import pyarrow.compute as pc
-IGNORE_THESHOLD = 5000.0
+IGNORE_THRESHOLD = 5000.0
class AboveThresholdAccum(Accumulator):
def __init__(self) -&gt; None:
self._sum = 0.0
def update(self, values: pa.Array) -&gt; None:
- over_threshold = pc.greater(values, pa.scalar(IGNORE_THESHOLD))
+ over_threshold = pc.greater(values, pa.scalar(IGNORE_THRESHOLD))
sum_above = pc.sum(values.filter(over_threshold)).as_py()
if sum_above is None:
sum_above = 0.0
@@ -9952,8 +9952,8 @@ Figure 1 illustrates the difference between the output of
both string representa
<h1 id="when-to-gc">When to GC?<a class="headerlink"
href="#when-to-gc" title="Permanent link">¶</a></h1>
<p>Zero-copy <code>take/filter</code> is great for
generating large arrays quickly, but it is suboptimal for highly selective
filters, where most of the strings are filtered out. When the cardinality
drops, StringViewArray buffers become sparse—only a small subset of the bytes
in the buffer’s memory are referred to by any <code>view</code>.
This leads to excessive memory usage, especially in a <a
href="https://github.com/apache/datafusion/issues/11628"> [...]
<p>To release unused memory, we implemented a <a
href="https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html#method.gc">garbage
collection (GC)</a> routine to consolidate the data into a new buffer to
release the old sparse buffer(s). As the GC operation copies strings, similarly
to StringArray, we must be careful about when to call it. If we call GC too
early, we cause unnecessary copying, losing much of the benefit of
StringViewArray. If we call GC [...]
-<p><code>arrow-rs</code> implements the GC process, but it
is up to users to decide when to call it. We leverage the semantics of the
query engine and observed that the <a
href="https://docs.rs/datafusion/latest/datafusion/physical_plan/coalesce_batches/struct.CoalesceBatchesExec.html"><code>CoalseceBatchesExec</code></a>
operator, which merge smaller batches to a larger batch, is often used after
the record cardinality is expected to shrink, whi [...]
-We, therefore,<a href="https://github.com/apache/datafusion/pull/11587">
implemented the GC procedure</a> inside
<code>CoalseceBatchesExec</code>[^5] with a heuristic that
estimates when the buffers are too sparse.</p>
+<p><code>arrow-rs</code> implements the GC process, but it
is up to users to decide when to call it. We leverage the semantics of the
query engine and observed that the <a
href="https://docs.rs/datafusion/latest/datafusion/physical_plan/coalesce_batches/struct.CoalesceBatchesExec.html"><code>CoalesceBatchesExec</code></a>
operator, which merge smaller batches to a larger batch, is often used after
the record cardinality is expected to shrink, whi [...]
+We, therefore,<a href="https://github.com/apache/datafusion/pull/11587">
implemented the GC procedure</a> inside
<code>CoalesceBatchesExec</code>[^5] with a heuristic that
estimates when the buffers are too sparse.</p>
<h2 id="the-art-of-function-inlining-not-too-much-not-too-little">The
art of function inlining: not too much, not too little<a class="headerlink"
href="#the-art-of-function-inlining-not-too-much-not-too-little"
title="Permanent link">¶</a></h2>
<p>Like string inlining, <em>function</em> inlining is the
process of embedding a short function into the caller to avoid the overhead of
function calls (caller/callee save).
Usually, the Rust compiler does a good job of deciding when to inline.
However, it is possible to override its default using the <a
href="https://doc.rust-lang.org/reference/attributes/codegen.html#the-inline-attribute"><code>#[inline(always)]</code>
directive</a>.
@@ -10152,7 +10152,7 @@ to their Rust counterparts.</li>
<p>The most significant difference is that we have added wrapper
functions and classes for most of the
user facing interface. These wrappers, written in Python, contain both
documentation and type
annotations.</p>
-<p>This documenation is now available on the <a
href="https://datafusion.apache.org/python/autoapi/datafusion/index.html">DataFusion
in Python API</a> website. There you can browse
+<p>This documentation is now available on the <a
href="https://datafusion.apache.org/python/autoapi/datafusion/index.html">DataFusion
in Python API</a> website. There you can browse
the available functions and classes to see the breadth of available
functionality.</p>
<p>Modern IDEs use language servers such as
<a
href="https://marketplace.visualstudio.com/items?itemName=ms-python.vscode-pylance">Pylance</a>
or
@@ -11076,7 +11076,7 @@ LIMIT 3;
3 rows in set. Query took 0.053 seconds.
</code></pre>
<h3 id="growth-of-datafusion">Growth of DataFusion 📈<a
class="headerlink" href="#growth-of-datafusion" title="Permanent
link">¶</a></h3>
-<p>DataFusion has been appearing more publically in the wild. For example
+<p>DataFusion has been appearing more publicly in the wild. For example
* New projects built using DataFusion such as <a
href="https://lancedb.com/">lancedb</a>, <a
href="https://glaredb.com/">GlareDB</a>, <a
href="https://www.arroyo.dev/">Arroyo</a>, and <a
href="https://github.com/cmu-db/optd">optd</a>.
* Public talks such as <a
href="https://www.youtube.com/watch?v=AJU9rdRNk9I">Apache Arrow Datafusion:
Vectorized
Execution Framework For Maximum Performance</a> in <a
href="https://www.bagevent.com/event/8432178">CommunityOverCode Asia
2023</a>
@@ -11828,7 +11828,7 @@ required synchronous access to all relevant catalog
information.</p>
<li>Automatic coercions ast between Date and Timestamp <a
href="https://github.com/apache/arrow-datafusion/issues/4726">#4726</a></li>
<li>Support type coercion for timestamp and utf8 <a
href="https://github.com/apache/arrow-datafusion/issues/4312">#4312</a></li>
<li>Full support for time32 and time64 literal values
(<code>ScalarValue</code>) <a
href="https://github.com/apache/arrow-datafusion/issues/4156">#4156</a></li>
-<li>New functions, incuding <code>uuid()</code> <a
href="https://github.com/apache/arrow-datafusion/issues/4041">#4041</a>,
<code>current_time</code> <a
href="https://github.com/apache/arrow-datafusion/issues/4054">#4054</a>,
<code>current_date</code> <a
href="https://github.com/apache/arrow-datafusion/issues/4022">#4022</a></li>
+<li>New functions, including <code>uuid()</code> <a
href="https://github.com/apache/arrow-datafusion/issues/4041">#4041</a>,
<code>current_time</code> <a
href="https://github.com/apache/arrow-datafusion/issues/4054">#4054</a>,
<code>current_date</code> <a
href="https://github.com/apache/arrow-datafusion/issues/4022">#4022</a></li>
<li>Compressed CSV/JSON support <a
href="https://github.com/apache/arrow-datafusion/issues/3642">#3642</a></li>
</ul>
<p>The community has also invested in new <a
href="https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/tests/sqllogictests/README.md">sqllogic
based</a> tests to keep improving DataFusion's quality with less
effort.</p>
@@ -12680,7 +12680,7 @@ git shortlog -sn 5.0.0..6.0.0 datafusion datafusion-cli
datafusion-examples | wc
<li>Switch from <code>std::sync::Mutex</code> to
<code>parking_lot::Mutex</code> <a
href="https://github.com/apache/arrow-datafusion/pull/1720">#1720</a></li>
<li>New Features</li>
<li>Support for memory tracking and spilling to disk<ul>
-<li>MemoryMananger and DiskManager <a
href="https://github.com/apache/arrow-datafusion/pull/1526">#1526</a></li>
+<li>MemoryManager and DiskManager <a
href="https://github.com/apache/arrow-datafusion/pull/1526">#1526</a></li>
<li>Out of core sort <a
href="https://github.com/apache/arrow-datafusion/pull/1526">#1526</a></li>
<li>New metrics</li>
<li><code>Gauge</code> and
<code>CurrentMemoryUsage</code> <a
href="https://github.com/apache/arrow-datafusion/pull/1682">#1682</a></li>
diff --git a/output/feeds/pmc.atom.xml b/output/feeds/pmc.atom.xml
index 498ee90..470aa0e 100644
--- a/output/feeds/pmc.atom.xml
+++ b/output/feeds/pmc.atom.xml
@@ -3840,7 +3840,7 @@ LIMIT 3;
3 rows in set. Query took 0.053 seconds.
</code></pre>
<h3 id="growth-of-datafusion">Growth of DataFusion 📈<a
class="headerlink" href="#growth-of-datafusion" title="Permanent
link">¶</a></h3>
-<p>DataFusion has been appearing more publically in the wild. For example
+<p>DataFusion has been appearing more publicly in the wild. For example
* New projects built using DataFusion such as <a
href="https://lancedb.com/">lancedb</a>, <a
href="https://glaredb.com/">GlareDB</a>, <a
href="https://www.arroyo.dev/">Arroyo</a>, and <a
href="https://github.com/cmu-db/optd">optd</a>.
* Public talks such as <a
href="https://www.youtube.com/watch?v=AJU9rdRNk9I">Apache Arrow Datafusion:
Vectorized
Execution Framework For Maximum Performance</a> in <a
href="https://www.bagevent.com/event/8432178">CommunityOverCode Asia
2023</a>
@@ -4269,7 +4269,7 @@ required synchronous access to all relevant catalog
information.</p>
<li>Automatic coercions ast between Date and Timestamp <a
href="https://github.com/apache/arrow-datafusion/issues/4726">#4726</a></li>
<li>Support type coercion for timestamp and utf8 <a
href="https://github.com/apache/arrow-datafusion/issues/4312">#4312</a></li>
<li>Full support for time32 and time64 literal values
(<code>ScalarValue</code>) <a
href="https://github.com/apache/arrow-datafusion/issues/4156">#4156</a></li>
-<li>New functions, incuding <code>uuid()</code> <a
href="https://github.com/apache/arrow-datafusion/issues/4041">#4041</a>,
<code>current_time</code> <a
href="https://github.com/apache/arrow-datafusion/issues/4054">#4054</a>,
<code>current_date</code> <a
href="https://github.com/apache/arrow-datafusion/issues/4022">#4022</a></li>
+<li>New functions, including <code>uuid()</code> <a
href="https://github.com/apache/arrow-datafusion/issues/4041">#4041</a>,
<code>current_time</code> <a
href="https://github.com/apache/arrow-datafusion/issues/4054">#4054</a>,
<code>current_date</code> <a
href="https://github.com/apache/arrow-datafusion/issues/4022">#4022</a></li>
<li>Compressed CSV/JSON support <a
href="https://github.com/apache/arrow-datafusion/issues/3642">#3642</a></li>
</ul>
<p>The community has also invested in new <a
href="https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/tests/sqllogictests/README.md">sqllogic
based</a> tests to keep improving DataFusion's quality with less
effort.</p>
@@ -5121,7 +5121,7 @@ git shortlog -sn 5.0.0..6.0.0 datafusion datafusion-cli
datafusion-examples | wc
<li>Switch from <code>std::sync::Mutex</code> to
<code>parking_lot::Mutex</code> <a
href="https://github.com/apache/arrow-datafusion/pull/1720">#1720</a></li>
<li>New Features</li>
<li>Support for memory tracking and spilling to disk<ul>
-<li>MemoryMananger and DiskManager <a
href="https://github.com/apache/arrow-datafusion/pull/1526">#1526</a></li>
+<li>MemoryManager and DiskManager <a
href="https://github.com/apache/arrow-datafusion/pull/1526">#1526</a></li>
<li>Out of core sort <a
href="https://github.com/apache/arrow-datafusion/pull/1526">#1526</a></li>
<li>New metrics</li>
<li><code>Gauge</code> and
<code>CurrentMemoryUsage</code> <a
href="https://github.com/apache/arrow-datafusion/pull/1682">#1682</a></li>
diff --git a/output/feeds/timsaucer.atom.xml b/output/feeds/timsaucer.atom.xml
index dab474a..268635c 100644
--- a/output/feeds/timsaucer.atom.xml
+++ b/output/feeds/timsaucer.atom.xml
@@ -75,7 +75,7 @@ to register the view and then use it in another
place:</p>
<pre><code class="language-python">ctx.register_view("view1", df1)
</code></pre>
<p>And then in another portion of your code which has access to the same
session context
-you can retrive the DataFrame with:</p>
+you can retrieve the DataFrame with:</p>
<pre><code>df2 = ctx.table("view1")
</code></pre>
<h2 id="asynchronous-iteration-of-record-batches">Asynchronous Iteration
of Record Batches<a class="headerlink"
href="#asynchronous-iteration-of-record-batches" title="Permanent
link">¶</a></h2>
@@ -275,7 +275,7 @@ consistent method for exposing these data structures across
libraries.</p>
<p>In <a
href="https://github.com/apache/datafusion-python/pull/825">PR
#825</a>, we introduced support for both importing and exporting Arrow
data in
<code>datafusion-python</code>. With this improvement, you can now
use a single function call to import
a table from <strong>any</strong> Python library that implements
the <a
href="https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html">Arrow
PyCapsule Interface</a>.
-Many popular libaries, such as <a
href="https://pandas.pydata.org/">Pandas</a> and <a
href="https://pola.rs/">Polars</a>
+Many popular libraries, such as <a
href="https://pandas.pydata.org/">Pandas</a> and <a
href="https://pola.rs/">Polars</a>
already support these interfaces.</p>
<p>Suppose you have a Pandas and Polars DataFrames named
<code>df_pandas</code> or <code>df_polars</code>,
respectively:</p>
<pre><code class="language-python">ctx = SessionContext()
@@ -335,7 +335,7 @@ of the blog post describing how these enhancements can lead
to 20-200% performan
gains in some tests.</p>
<p>During our testing we identified some cases where we needed to adjust
workflows to
account for the fact that StringView is now the default type for string based
operations.
-First, when performing manipulations on string objects there is a perfomance
loss when
+First, when performing manipulations on string objects there is a performance
loss when
needing to cast from string to string view or vice versa. To reap the best
performance,
ideally all of your string type data will use StringView. For most users this
should be
transparent. However if you specify a schema for reading or creating data,
then you
@@ -472,7 +472,7 @@ than a join can be significantly faster. This is worth
profiling for your specif
<p>I have a DataFrame with many values that I want to aggregate. I have
already analyzed it and
determined there is a noise level below which I do not want to include in my
analysis. I want to
compute a sum of only values that are above my noise threshold.</p>
-<p>This can be done fairly easy without leaning on a User Defined
Aggegate Function (UDAF). You can
+<p>This can be done fairly easy without leaning on a User Defined
Aggregate Function (UDAF). You can
simply filter the DataFrame and then aggregate using the built-in
<code>sum</code> function. Here, we
demonstrate doing this as a UDF primarily as an example of how to write UDAFs.
We will use the
PyArrow compute approach.</p>
@@ -633,7 +633,7 @@ Python, is to primarily demonstrate how to make the Python
to Rust with Python w
transition. In the second implementation you can see how we can iterate
through all of the arrays
ourselves.</p>
<p>In this first example, we are hard coding the values of interest, but
in the following section
-we demonstrate passing these in during initalization.</p>
+we demonstrate passing these in during initialization.</p>
<pre><code class="language-rust">#[pyfunction]
pub fn tuple_filter_fn(
py: Python&lt;'_&gt;,
@@ -856,13 +856,13 @@ how much they have ordered total. We want to ignore small
orders, which we defin
import pyarrow as pa
import pyarrow.compute as pc
-IGNORE_THESHOLD = 5000.0
+IGNORE_THRESHOLD = 5000.0
class AboveThresholdAccum(Accumulator):
def __init__(self) -&gt; None:
self._sum = 0.0
def update(self, values: pa.Array) -&gt; None:
- over_threshold = pc.greater(values, pa.scalar(IGNORE_THESHOLD))
+ over_threshold = pc.greater(values, pa.scalar(IGNORE_THRESHOLD))
sum_above = pc.sum(values.filter(over_threshold)).as_py()
if sum_above is None:
sum_above = 0.0
@@ -996,7 +996,7 @@ to their Rust counterparts.</li>
<p>The most significant difference is that we have added wrapper
functions and classes for most of the
user facing interface. These wrappers, written in Python, contain both
documentation and type
annotations.</p>
-<p>This documenation is now available on the <a
href="https://datafusion.apache.org/python/autoapi/datafusion/index.html">DataFusion
in Python API</a> website. There you can browse
+<p>This documentation is now available on the <a
href="https://datafusion.apache.org/python/autoapi/datafusion/index.html">DataFusion
in Python API</a> website. There you can browse
the available functions and classes to see the breadth of available
functionality.</p>
<p>Modern IDEs use language servers such as
<a
href="https://marketplace.visualstudio.com/items?itemName=ms-python.vscode-pylance">Pylance</a>
or
diff --git a/output/feeds/xiangpeng-hao-andrew-lamb.atom.xml
b/output/feeds/xiangpeng-hao-andrew-lamb.atom.xml
index 155166a..eba0839 100644
--- a/output/feeds/xiangpeng-hao-andrew-lamb.atom.xml
+++ b/output/feeds/xiangpeng-hao-andrew-lamb.atom.xml
@@ -193,8 +193,8 @@ Figure 1 illustrates the difference between the output of
both string representa
<h1 id="when-to-gc">When to GC?<a class="headerlink"
href="#when-to-gc" title="Permanent link">¶</a></h1>
<p>Zero-copy <code>take/filter</code> is great for
generating large arrays quickly, but it is suboptimal for highly selective
filters, where most of the strings are filtered out. When the cardinality
drops, StringViewArray buffers become sparse—only a small subset of the bytes
in the buffer’s memory are referred to by any <code>view</code>.
This leads to excessive memory usage, especially in a <a
href="https://github.com/apache/datafusion/issues/11628"> [...]
<p>To release unused memory, we implemented a <a
href="https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html#method.gc">garbage
collection (GC)</a> routine to consolidate the data into a new buffer to
release the old sparse buffer(s). As the GC operation copies strings, similarly
to StringArray, we must be careful about when to call it. If we call GC too
early, we cause unnecessary copying, losing much of the benefit of
StringViewArray. If we call GC [...]
-<p><code>arrow-rs</code> implements the GC process, but it
is up to users to decide when to call it. We leverage the semantics of the
query engine and observed that the <a
href="https://docs.rs/datafusion/latest/datafusion/physical_plan/coalesce_batches/struct.CoalesceBatchesExec.html"><code>CoalseceBatchesExec</code></a>
operator, which merge smaller batches to a larger batch, is often used after
the record cardinality is expected to shrink, whi [...]
-We, therefore,<a href="https://github.com/apache/datafusion/pull/11587">
implemented the GC procedure</a> inside
<code>CoalseceBatchesExec</code>[^5] with a heuristic that
estimates when the buffers are too sparse.</p>
+<p><code>arrow-rs</code> implements the GC process, but it
is up to users to decide when to call it. We leverage the semantics of the
query engine and observed that the <a
href="https://docs.rs/datafusion/latest/datafusion/physical_plan/coalesce_batches/struct.CoalesceBatchesExec.html"><code>CoalesceBatchesExec</code></a>
operator, which merge smaller batches to a larger batch, is often used after
the record cardinality is expected to shrink, whi [...]
+We, therefore,<a href="https://github.com/apache/datafusion/pull/11587">
implemented the GC procedure</a> inside
<code>CoalesceBatchesExec</code>[^5] with a heuristic that
estimates when the buffers are too sparse.</p>
<h2 id="the-art-of-function-inlining-not-too-much-not-too-little">The
art of function inlining: not too much, not too little<a class="headerlink"
href="#the-art-of-function-inlining-not-too-much-not-too-little"
title="Permanent link">¶</a></h2>
<p>Like string inlining, <em>function</em> inlining is the
process of embedding a short function into the caller to avoid the overhead of
function calls (caller/callee save).
Usually, the Rust compiler does a good job of deciding when to inline.
However, it is possible to override its default using the <a
href="https://doc.rust-lang.org/reference/attributes/codegen.html#the-inline-attribute"><code>#[inline(always)]</code>
directive</a>.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]