lidavidm commented on code in PR #667: URL: https://github.com/apache/arrow-site/pull/667#discussion_r2206057719
########## _posts/2025-07-07-recent-improvements-to-hash-join.md: ########## @@ -0,0 +1,151 @@ +--- +layout: post +title: "Recent Improvements to Hash Join in Arrow C++" +description: "A deep dive into recent improvements to Apache Arrow’s hash join implementation—enhancing stability, memory efficiency, and parallel performance for modern analytic workloads." +date: "2025-07-07 00:00:00" +author: zanmato +categories: [application] +--- +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +*Edited by Apache Arrow PMC.* + +*Editor’s Note: Apache Arrow is an expansive project, ranging from the Arrow columnar format itself, to its numerous specifications, and a long list of implementations. Arrow is also an expansive project in terms of its community of contributors. In this blog post, we’d like to highlight recent work by Apache Arrow Committer Rossi Sun on improving the performance and stability of Arrow’s embeddable query execution engine: Acero.* + +# Introduction + +Hash join is a fundamental operation in analytical processing engines. In the C++ implementation of Apache Arrow, the hash join is implemented in the C++ engine Acero, which powers query execution in bindings like PyArrow and the R Arrow package. Even if you haven't used Acero directly, your code may already be benefiting from it under the hood. + +For example, this simple PyArrow is using Acero: +```python +import pandas as pd +import pyarrow as pa +df1 = pd.DataFrame({'id': [1, 2, 3], + 'year': [2020, 2022, 2019]}) +df2 = pd.DataFrame({'id': [3, 4], + 'n_legs': [5, 100], + 'animal': ["Brittle stars", "Centipede"]}) +t1 = pa.Table.from_pandas(df1) +t2 = pa.Table.from_pandas(df2) +t1.join(t2, 'id').combine_chunks().sort_by('year') +``` + +Acero was originally created in 2019 to demonstrate that the ever-growing library of compute kernels in Arrow C++ could be linked together into realistic workflows and also to take advantage of the emerging Datasets API to give these workflows access to data. While it was never intended to be an alternative to more popular tools of the time nor has it tried to compete with tools that have emerged since (such as DuckDB), Acero has certainly proved its original purpose and continued to evolve to meet user needs. Review Comment: I think my question here is: what are those user needs, if it isn't intended to compete? It's a bit off topic for the actual subject, but given that the conclusion mentions building blocks, maybe that idea could be introduced here ########## _posts/2025-07-07-recent-improvements-to-hash-join.md: ########## @@ -0,0 +1,151 @@ +--- +layout: post +title: "Recent Improvements to Hash Join in Arrow C++" +description: "A deep dive into recent improvements to Apache Arrow’s hash join implementation—enhancing stability, memory efficiency, and parallel performance for modern analytic workloads." +date: "2025-07-07 00:00:00" +author: zanmato +categories: [application] +--- +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +*Edited by Apache Arrow PMC.* + +*Editor’s Note: Apache Arrow is an expansive project, ranging from the Arrow columnar format itself, to its numerous specifications, and a long list of implementations. Arrow is also an expansive project in terms of its community of contributors. In this blog post, we’d like to highlight recent work by Apache Arrow Committer Rossi Sun on improving the performance and stability of Arrow’s embeddable query execution engine: Acero.* + +# Introduction + +Hash join is a fundamental operation in analytical processing engines. In the C++ implementation of Apache Arrow, the hash join is implemented in the C++ engine Acero, which powers query execution in bindings like PyArrow and the R Arrow package. Even if you haven't used Acero directly, your code may already be benefiting from it under the hood. + +For example, this simple PyArrow is using Acero: +```python +import pandas as pd +import pyarrow as pa +df1 = pd.DataFrame({'id': [1, 2, 3], + 'year': [2020, 2022, 2019]}) +df2 = pd.DataFrame({'id': [3, 4], + 'n_legs': [5, 100], + 'animal': ["Brittle stars", "Centipede"]}) +t1 = pa.Table.from_pandas(df1) +t2 = pa.Table.from_pandas(df2) +t1.join(t2, 'id').combine_chunks().sort_by('year') +``` Review Comment: Maybe use `pyarrow.Table.from_dict` instead of pulling in Pandas? ########## _posts/2025-07-07-recent-improvements-to-hash-join.md: ########## @@ -0,0 +1,151 @@ +--- +layout: post +title: "Recent Improvements to Hash Join in Arrow C++" +description: "A deep dive into recent improvements to Apache Arrow’s hash join implementation—enhancing stability, memory efficiency, and parallel performance for modern analytic workloads." +date: "2025-07-07 00:00:00" +author: zanmato +categories: [application] +--- +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +*Edited by Apache Arrow PMC.* + +*Editor’s Note: Apache Arrow is an expansive project, ranging from the Arrow columnar format itself, to its numerous specifications, and a long list of implementations. Arrow is also an expansive project in terms of its community of contributors. In this blog post, we’d like to highlight recent work by Apache Arrow Committer Rossi Sun on improving the performance and stability of Arrow’s embeddable query execution engine: Acero.* + +# Introduction + +Hash join is a fundamental operation in analytical processing engines. In the C++ implementation of Apache Arrow, the hash join is implemented in the C++ engine Acero, which powers query execution in bindings like PyArrow and the R Arrow package. Even if you haven't used Acero directly, your code may already be benefiting from it under the hood. + +For example, this simple PyArrow is using Acero: Review Comment: ```suggestion For example, this simple PyArrow example uses Acero: ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org