(datafusion-site) branch main updated: Ci/add typos spellcheck (#145)

jeffreyvo Thu, 05 Feb 2026 17:47:44 -0800

This is an automated email from the ASF dual-hosted git repository.

jeffreyvo pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion-site.git



The following commit(s) were added to refs/heads/main by this push:
     new f164a2d  Ci/add typos spellcheck (#145)
f164a2d is described below

commit f164a2debef3d1f8e27d7e4fdc4cf1a77c58399a
Author: Abhinandan Kaushik <[email protected]>
AuthorDate: Fri Feb 6 07:17:21 2026 +0530

    Ci/add typos spellcheck (#145)
    
    * fixed:about
    
    * Update content/theme/templates/menu.html
    
    Co-authored-by: Copilot <[email protected]>
    
    * added spell check action
    
    * added spell check github-action
    
    * fixed blog-typos and added flase-positive in the list
    
    * typo lower-case issue fixed
    
    * typos fix-attempt-1
    
    ---------
    
    Co-authored-by: Copilot <[email protected]>
---
 .github/workflows/typo-check.yml                   | 36 +++++++++++++++++
 _typos.toml                                        | 47 ++++++++++++++++++++++
 content/blog/2022-02-28-datafusion-7.0.0.md        |  2 +-
 content/blog/2023-01-19-datafusion-16.0.0.md       |  2 +-
 content/blog/2024-01-19-datafusion-34.0.0.md       |  2 +-
 .../blog/2024-08-20-python-datafusion-40.0.0.md    |  2 +-
 ...9-13-string-view-german-style-strings-part-2.md |  4 +-
 ...2024-11-19-datafusion-python-udf-comparisons.md |  8 ++--
 .../blog/2024-12-14-datafusion-python-43.1.0.md    |  4 +-
 .../blog/2025-03-30-datafusion-python-46.0.0.md    |  2 +-
 10 files changed, 96 insertions(+), 13 deletions(-)

diff --git a/.github/workflows/typo-check.yml b/.github/workflows/typo-check.yml
new file mode 100644
index 0000000..7d90595
--- /dev/null
+++ b/.github/workflows/typo-check.yml
@@ -0,0 +1,36 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+name: Typo Check
+
+on:
+  push:
+    branches:
+      - main
+  pull_request:
+
+jobs:
+  typos:
+    name: Spell Check with Typos
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8  # 
v5.0.0
+        with:
+          persist-credentials: false
+      - uses: crate-ci/typos@40156d6074bf731adb169cfb8234954971dbc487  # 
v1.37.1
+        with:
+          files: ./content/blog/
diff --git a/_typos.toml b/_typos.toml
new file mode 100644
index 0000000..da91470
--- /dev/null
+++ b/_typos.toml
@@ -0,0 +1,47 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# Configuration for typos spell checker
+# https://github.com/crate-ci/typos
+
+# Ignore patterns for strings that contain false positives
+[default]
+extend-ignore-re = [
+  "chloro-pn",                # GitHub username
+  "2010YOUY01",               # GitHub username  
+  "RinChanNOWWW",             # GitHub username
+  "ANDed",                    # Technical term (ANDed predicates)
+  "NDJson",                   # Data format name
+  "efully express fo\\|",     # TPC-H dataset artifact (truncated data)
+]
+
+# Custom dictionary for technical terms (whole word matching only)
+[default.extend-words]
+# GitHub usernames (lowercase for case-insensitive matching)
+youy = "youy"
+
+# Product/Service names
+vertica = "vertica"
+
+# Personal names
+parth = "parth"
+authers = "authers"
+
+# Technical terms
+ndjson = "ndjson"
+anded = "anded"
+rin = "rin"
diff --git a/content/blog/2022-02-28-datafusion-7.0.0.md 
b/content/blog/2022-02-28-datafusion-7.0.0.md
index 860351a..3b93d5e 100644
--- a/content/blog/2022-02-28-datafusion-7.0.0.md
+++ b/content/blog/2022-02-28-datafusion-7.0.0.md
@@ -95,7 +95,7 @@ The following section highlights some of the improvements in 
this release. Of co
   - Switch from `std::sync::Mutex` to `parking_lot::Mutex` 
[#1720](https://github.com/apache/arrow-datafusion/pull/1720)
 - New Features
   - Support for memory tracking and spilling to disk
-    - MemoryMananger and DiskManager 
[#1526](https://github.com/apache/arrow-datafusion/pull/1526)
+    - MemoryManager and DiskManager 
[#1526](https://github.com/apache/arrow-datafusion/pull/1526)
     - Out of core sort 
[#1526](https://github.com/apache/arrow-datafusion/pull/1526)
     - New metrics
       - `Gauge` and `CurrentMemoryUsage` 
[#1682](https://github.com/apache/arrow-datafusion/pull/1682)
diff --git a/content/blog/2023-01-19-datafusion-16.0.0.md 
b/content/blog/2023-01-19-datafusion-16.0.0.md
index e04d22f..c2f8e28 100644
--- a/content/blog/2023-01-19-datafusion-16.0.0.md
+++ b/content/blog/2023-01-19-datafusion-16.0.0.md
@@ -157,7 +157,7 @@ SQL support continues to improve, including some of these 
highlights:
 - Automatic coercions ast between Date and Timestamp [#4726]
 - Support type coercion for timestamp and utf8 [#4312]
 - Full support for time32 and time64 literal values (`ScalarValue`) [#4156]
-- New functions, incuding `uuid()`  [#4041], `current_time`  [#4054], 
`current_date` [#4022]
+- New functions, including `uuid()`  [#4041], `current_time`  [#4054], 
`current_date` [#4022]
 - Compressed CSV/JSON support [#3642]
 
 The community has also invested in new [sqllogic 
based](https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/tests/sqllogictests/README.md)
 tests to keep improving DataFusion's quality with less effort.
diff --git a/content/blog/2024-01-19-datafusion-34.0.0.md 
b/content/blog/2024-01-19-datafusion-34.0.0.md
index 897331f..2f95ccc 100644
--- a/content/blog/2024-01-19-datafusion-34.0.0.md
+++ b/content/blog/2024-01-19-datafusion-34.0.0.md
@@ -292,7 +292,7 @@ LIMIT 3;
 
 
 ### Growth of DataFusion 📈
-DataFusion has been appearing more publically in the wild. For example
+DataFusion has been appearing more publicly in the wild. For example
 * New projects built using DataFusion such as [lancedb], [GlareDB], [Arroyo], 
and [optd].
 * Public talks such as [Apache Arrow Datafusion: Vectorized
   Execution Framework For Maximum Performance] in [CommunityOverCode Asia 
2023] 
diff --git a/content/blog/2024-08-20-python-datafusion-40.0.0.md 
b/content/blog/2024-08-20-python-datafusion-40.0.0.md
index 63de217..dd3b4e6 100644
--- a/content/blog/2024-08-20-python-datafusion-40.0.0.md
+++ b/content/blog/2024-08-20-python-datafusion-40.0.0.md
@@ -59,7 +59,7 @@ The most significant difference is that we have added wrapper 
functions and clas
 user facing interface. These wrappers, written in Python, contain both 
documentation and type
 annotations.
 
-This documenation is now available on the [DataFusion in Python API] website. 
There you can browse
+This documentation is now available on the [DataFusion in Python API] website. 
There you can browse
 the available functions and classes to see the breadth of available 
functionality.
 
 Modern IDEs use language servers such as
diff --git a/content/blog/2024-09-13-string-view-german-style-strings-part-2.md 
b/content/blog/2024-09-13-string-view-german-style-strings-part-2.md
index 4e321ec..7fb64f5 100644
--- a/content/blog/2024-09-13-string-view-german-style-strings-part-2.md
+++ b/content/blog/2024-09-13-string-view-german-style-strings-part-2.md
@@ -80,8 +80,8 @@ Zero-copy `take/filter` is great for generating large arrays 
quickly, but it is
 
 To release unused memory, we implemented a [garbage collection 
(GC)](https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html#method.gc)
 routine to consolidate the data into a new buffer to release the old sparse 
buffer(s). As the GC operation copies strings, similarly to StringArray, we 
must be careful about when to call it. If we call GC too early, we cause 
unnecessary copying, losing much of the benefit of StringViewArray. If we call 
GC too late, we hold large buffers [...]
 
-`arrow-rs` implements the GC process, but it is up to users to decide when to 
call it. We leverage the semantics of the query engine and observed that the 
[`CoalseceBatchesExec`](https://docs.rs/datafusion/latest/datafusion/physical_plan/coalesce_batches/struct.CoalesceBatchesExec.html)
 operator, which merge smaller batches to a larger batch, is often used after 
the record cardinality is expected to shrink, which aligns perfectly with the 
scenario of GC in StringViewArray. 
-We, therefore,[ implemented the GC 
procedure](https://github.com/apache/datafusion/pull/11587) inside 
<code>CoalseceBatchesExec</code>[^5] with a heuristic that estimates when the 
buffers are too sparse.
+`arrow-rs` implements the GC process, but it is up to users to decide when to 
call it. We leverage the semantics of the query engine and observed that the 
[`CoalesceBatchesExec`](https://docs.rs/datafusion/latest/datafusion/physical_plan/coalesce_batches/struct.CoalesceBatchesExec.html)
 operator, which merge smaller batches to a larger batch, is often used after 
the record cardinality is expected to shrink, which aligns perfectly with the 
scenario of GC in StringViewArray. 
+We, therefore,[ implemented the GC 
procedure](https://github.com/apache/datafusion/pull/11587) inside 
<code>CoalesceBatchesExec</code>[^5] with a heuristic that estimates when the 
buffers are too sparse.
 
 
 ## The art of function inlining: not too much, not too little
diff --git a/content/blog/2024-11-19-datafusion-python-udf-comparisons.md 
b/content/blog/2024-11-19-datafusion-python-udf-comparisons.md
index fb0926e..6a11f38 100644
--- a/content/blog/2024-11-19-datafusion-python-udf-comparisons.md
+++ b/content/blog/2024-11-19-datafusion-python-udf-comparisons.md
@@ -107,7 +107,7 @@ I have a DataFrame with many values that I want to 
aggregate. I have already ana
 determined there is a noise level below which I do not want to include in my 
analysis. I want to
 compute a sum of only values that are above my noise threshold.
 
-This can be done fairly easy without leaning on a User Defined Aggegate 
Function (UDAF). You can
+This can be done fairly easy without leaning on a User Defined Aggregate 
Function (UDAF). You can
 simply filter the DataFrame and then aggregate using the built-in `sum` 
function. Here, we
 demonstrate doing this as a UDF primarily as an example of how to write UDAFs. 
We will use the
 PyArrow compute approach.
@@ -293,7 +293,7 @@ transition. In the second implementation you can see how we 
can iterate through
 ourselves.
 
 In this first example, we are hard coding the values of interest, but in the 
following section
-we demonstrate passing these in during initalization.
+we demonstrate passing these in during initialization.
 
 ```rust
 #[pyfunction]
@@ -540,13 +540,13 @@ from datafusion import Accumulator, udaf
 import pyarrow as pa
 import pyarrow.compute as pc
 
-IGNORE_THESHOLD = 5000.0
+IGNORE_THRESHOLD = 5000.0
 class AboveThresholdAccum(Accumulator):
     def __init__(self) -> None:
         self._sum = 0.0
 
     def update(self, values: pa.Array) -> None:
-        over_threshold = pc.greater(values, pa.scalar(IGNORE_THESHOLD))
+        over_threshold = pc.greater(values, pa.scalar(IGNORE_THRESHOLD))
         sum_above = pc.sum(values.filter(over_threshold)).as_py()
         if sum_above is None:
             sum_above = 0.0
diff --git a/content/blog/2024-12-14-datafusion-python-43.1.0.md 
b/content/blog/2024-12-14-datafusion-python-43.1.0.md
index bad007b..9987256 100644
--- a/content/blog/2024-12-14-datafusion-python-43.1.0.md
+++ b/content/blog/2024-12-14-datafusion-python-43.1.0.md
@@ -54,7 +54,7 @@ consistent method for exposing these data structures across 
libraries.
 In [PR #825], we introduced support for both importing and exporting Arrow 
data in
 `datafusion-python`. With this improvement, you can now use a single function 
call to import
 a table from **any** Python library that implements the [Arrow PyCapsule 
Interface].
-Many popular libaries, such as [Pandas](https://pandas.pydata.org/) and 
[Polars](https://pola.rs/)
+Many popular libraries, such as [Pandas](https://pandas.pydata.org/) and 
[Polars](https://pola.rs/)
 already support these interfaces.
 
 Suppose you have a Pandas and Polars DataFrames named `df_pandas` or 
`df_polars`, respectively:
@@ -146,7 +146,7 @@ gains in some tests.
 
 During our testing we identified some cases where we needed to adjust 
workflows to
 account for the fact that StringView is now the default type for string based 
operations.
-First, when performing manipulations on string objects there is a perfomance 
loss when
+First, when performing manipulations on string objects there is a performance 
loss when
 needing to cast from string to string view or vice versa. To reap the best 
performance,
 ideally all of your string type data will use StringView. For most users this 
should be
 transparent. However if you specify a schema for reading or creating data, 
then you
diff --git a/content/blog/2025-03-30-datafusion-python-46.0.0.md 
b/content/blog/2025-03-30-datafusion-python-46.0.0.md
index d64a1ad..357aa8a 100644
--- a/content/blog/2025-03-30-datafusion-python-46.0.0.md
+++ b/content/blog/2025-03-30-datafusion-python-46.0.0.md
@@ -84,7 +84,7 @@ ctx.register_view("view1", df1)
 ```
 
 And then in another portion of your code which has access to the same session 
context
-you can retrive the DataFrame with:
+you can retrieve the DataFrame with:
 
 ```
 df2 = ctx.table("view1")


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(datafusion-site) branch main updated: Ci/add typos spellcheck (#145)

Reply via email to