Re: [PR] Blog post: Recent Improvements to Hash Join in Arrow C++ [arrow-site]

via GitHub Mon, 14 Jul 2025 17:49:47 -0700


lidavidm commented on code in PR #667:
URL: https://github.com/apache/arrow-site/pull/667#discussion_r2206057719



##########
_posts/2025-07-07-recent-improvements-to-hash-join.md:
##########
@@ -0,0 +1,151 @@
+---
+layout: post
+title: "Recent Improvements to Hash Join in Arrow C++"
+description: "A deep dive into recent improvements to Apache Arrow’s hash join 
implementation—enhancing stability, memory efficiency, and parallel performance 
for modern analytic workloads."
+date: "2025-07-07 00:00:00"
+author: zanmato
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+*Edited by Apache Arrow PMC.*
+
+*Editor’s Note: Apache Arrow is an expansive project, ranging from the Arrow 
columnar format itself, to its numerous specifications, and a long list of 
implementations. Arrow is also an expansive project in terms of its community 
of contributors. In this blog post, we’d like to highlight recent work by 
Apache Arrow Committer Rossi Sun on improving the performance and stability of 
Arrow’s embeddable query execution engine: Acero.*
+
+# Introduction
+
+Hash join is a fundamental operation in analytical processing engines. In the 
C++ implementation of Apache Arrow, the hash join is implemented in the C++ 
engine Acero, which powers query execution in bindings like PyArrow and the R 
Arrow package. Even if you haven't used Acero directly, your code may already 
be benefiting from it under the hood.
+
+For example, this simple PyArrow is using Acero:
+```python
+import pandas as pd
+import pyarrow as pa
+df1 = pd.DataFrame({'id': [1, 2, 3],
+                    'year': [2020, 2022, 2019]})
+df2 = pd.DataFrame({'id': [3, 4],
+                    'n_legs': [5, 100],
+                    'animal': ["Brittle stars", "Centipede"]})
+t1 = pa.Table.from_pandas(df1)
+t2 = pa.Table.from_pandas(df2)
+t1.join(t2, 'id').combine_chunks().sort_by('year')
+```
+
+Acero was originally created in 2019 to demonstrate that the ever-growing 
library of compute kernels in Arrow C++ could be linked together into realistic 
workflows and also to take advantage of the emerging Datasets API to give these 
workflows access to data. While it was never intended to be an alternative to 
more popular tools of the time nor has it tried to compete with tools that have 
emerged since (such as DuckDB), Acero has certainly proved its original purpose 
and continued to evolve to meet user needs.

Review Comment:
   I think my question here is: what are those user needs, if it isn't intended 
to compete? It's a bit off topic for the actual subject, but given that the 
conclusion mentions building blocks, maybe that idea could be introduced here



##########
_posts/2025-07-07-recent-improvements-to-hash-join.md:
##########
@@ -0,0 +1,151 @@
+---
+layout: post
+title: "Recent Improvements to Hash Join in Arrow C++"
+description: "A deep dive into recent improvements to Apache Arrow’s hash join 
implementation—enhancing stability, memory efficiency, and parallel performance 
for modern analytic workloads."
+date: "2025-07-07 00:00:00"
+author: zanmato
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+*Edited by Apache Arrow PMC.*
+
+*Editor’s Note: Apache Arrow is an expansive project, ranging from the Arrow 
columnar format itself, to its numerous specifications, and a long list of 
implementations. Arrow is also an expansive project in terms of its community 
of contributors. In this blog post, we’d like to highlight recent work by 
Apache Arrow Committer Rossi Sun on improving the performance and stability of 
Arrow’s embeddable query execution engine: Acero.*
+
+# Introduction
+
+Hash join is a fundamental operation in analytical processing engines. In the 
C++ implementation of Apache Arrow, the hash join is implemented in the C++ 
engine Acero, which powers query execution in bindings like PyArrow and the R 
Arrow package. Even if you haven't used Acero directly, your code may already 
be benefiting from it under the hood.
+
+For example, this simple PyArrow is using Acero:
+```python
+import pandas as pd
+import pyarrow as pa
+df1 = pd.DataFrame({'id': [1, 2, 3],
+                    'year': [2020, 2022, 2019]})
+df2 = pd.DataFrame({'id': [3, 4],
+                    'n_legs': [5, 100],
+                    'animal': ["Brittle stars", "Centipede"]})
+t1 = pa.Table.from_pandas(df1)
+t2 = pa.Table.from_pandas(df2)
+t1.join(t2, 'id').combine_chunks().sort_by('year')
+```

Review Comment:
   Maybe use `pyarrow.Table.from_dict` instead of pulling in Pandas?



##########
_posts/2025-07-07-recent-improvements-to-hash-join.md:
##########
@@ -0,0 +1,151 @@
+---
+layout: post
+title: "Recent Improvements to Hash Join in Arrow C++"
+description: "A deep dive into recent improvements to Apache Arrow’s hash join 
implementation—enhancing stability, memory efficiency, and parallel performance 
for modern analytic workloads."
+date: "2025-07-07 00:00:00"
+author: zanmato
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+*Edited by Apache Arrow PMC.*
+
+*Editor’s Note: Apache Arrow is an expansive project, ranging from the Arrow 
columnar format itself, to its numerous specifications, and a long list of 
implementations. Arrow is also an expansive project in terms of its community 
of contributors. In this blog post, we’d like to highlight recent work by 
Apache Arrow Committer Rossi Sun on improving the performance and stability of 
Arrow’s embeddable query execution engine: Acero.*
+
+# Introduction
+
+Hash join is a fundamental operation in analytical processing engines. In the 
C++ implementation of Apache Arrow, the hash join is implemented in the C++ 
engine Acero, which powers query execution in bindings like PyArrow and the R 
Arrow package. Even if you haven't used Acero directly, your code may already 
be benefiting from it under the hood.
+
+For example, this simple PyArrow is using Acero:

Review Comment:
   ```suggestion
   For example, this simple PyArrow example uses Acero:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Blog post: Recent Improvements to Hash Join in Arrow C++ [arrow-site]

Reply via email to