tustvold commented on code in PR #264:
URL: https://github.com/apache/arrow-site/pull/264#discussion_r1010891369


##########
_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-1.md:
##########
@@ -0,0 +1,217 @@
+---
+layout: post
+title: "Fast and Memory Efficient Multi-Column Sorts in Apache Arrow Rust, 
Part 1"
+date: "2022-10-30 00:00:00"
+author: "tustvold and alamb"
+categories: [arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+Sorting is one of the most fundamental operations in modern databases and 
other analytic systems, underpinning common operators such as aggregates, 
joins, window functions, merge, and more. By some estimates, more than half of 
the execution time in data processing systems is spent sorting. Optimizing 
sorts is therefore vital to improving query performance and overall system 
efficiency.
+
+Sorting is also one of the most well studied topics in computer science. The 
classic survey paper for databases is [Implementing Sorting in Database 
Systems](https://dl.acm.org/doi/10.1145/1132960.1132964) by Goetz Graefe which 
provides a thorough academic treatment that is still very applicable today. 
However, it may not be obvious how to apply the wisdom and advanced techniques 
described in that paper to modern systems.
+
+In this blog post we explain in detail the new [row 
format](https://docs.rs/arrow/25.0.0/arrow/row/index.html) in the [Rust 
implementation](https://github.com/apache/arrow-rs) of [Apache 
Arrow](https://arrow.apache.org/), and how this can be used to perform 
blazingly fast multi-column sorts. The excellent [DuckDB blog on 
sorting](https://duckdb.org/2021/08/27/external-sorting.html) highlights 
several sorting techniques, and mentions such a comparable row format, but it 
does not explain how to efficiently sort variable length strings or dictionary 
encoded data, which we do in this post.

Review Comment:
   ```suggestion
   In this blog post we explain in detail the new [row 
format](https://docs.rs/arrow/25.0.0/arrow/row/index.html) in the [Rust 
implementation](https://github.com/apache/arrow-rs) of [Apache 
Arrow](https://arrow.apache.org/), and how this can be used to perform 
blazingly fast multi-column sorts. The excellent [DuckDB blog on 
sorting](https://duckdb.org/2021/08/27/external-sorting.html) highlights 
several sorting techniques, and mentions such a comparable row format, but it 
does not explain how to efficiently sort variable length strings or dictionary 
encoded data, which we do in this series.
   ```



##########
_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-1.md:
##########
@@ -0,0 +1,217 @@
+---
+layout: post
+title: "Fast and Memory Efficient Multi-Column Sorts in Apache Arrow Rust, 
Part 1"
+date: "2022-10-30 00:00:00"
+author: "tustvold and alamb"
+categories: [arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+Sorting is one of the most fundamental operations in modern databases and 
other analytic systems, underpinning common operators such as aggregates, 
joins, window functions, merge, and more. By some estimates, more than half of 
the execution time in data processing systems is spent sorting. Optimizing 
sorts is therefore vital to improving query performance and overall system 
efficiency.
+
+Sorting is also one of the most well studied topics in computer science. The 
classic survey paper for databases is [Implementing Sorting in Database 
Systems](https://dl.acm.org/doi/10.1145/1132960.1132964) by Goetz Graefe which 
provides a thorough academic treatment that is still very applicable today. 
However, it may not be obvious how to apply the wisdom and advanced techniques 
described in that paper to modern systems.
+
+In this blog post we explain in detail the new [row 
format](https://docs.rs/arrow/25.0.0/arrow/row/index.html) in the [Rust 
implementation](https://github.com/apache/arrow-rs) of [Apache 
Arrow](https://arrow.apache.org/), and how this can be used to perform 
blazingly fast multi-column sorts. The excellent [DuckDB blog on 
sorting](https://duckdb.org/2021/08/27/external-sorting.html) highlights 
several sorting techniques, and mentions such a comparable row format, but it 
does not explain how to efficiently sort variable length strings or dictionary 
encoded data, which we do in this post.
+
+## Multicolumn / Lexicographical Sort Problem
+
+Most languages have native, optimized operations to sort a single column 
(array) of data, which are specialized based on the type of data being sorted. 
The reason that sorting is typically more challenging in analytic systems is 
that they must:
+
+1. Support sorting by multiple columns of data
+2. The column types are not knowable at compile time, and thus the compiler 
can not typically generate optimized code.
+
+Multicolumn sorting is also referred to as lexicographical sorting in some 
libraries.
+
+For example, given sales data for various customers and their state of 
residence, a user might want to find the lowest 10 orders for each state. One 
way to do so is to order the data first by `State` and then by `Orders`:
+
+```text
+Customer | State | Orders
+—--------+-------+-------
+12345    |  MA   |  10.12
+532432   |  MA   |  8.44
+12345    |  CA   |  3.25
+56232    |  WA   |  6.00
+23442    |  WA   |  132.50
+7844     |  CA   |  9.33
+852353   |  MA   |  1.30
+```
+
+(Note: While there are specialized ways for computing this particular query 
other that sorting the entire input (“TopK”), they typically need the same 
multi-column comparison operation described below, so we will use the 
simplified example in our post but it does apply more broadly)
+
+## Basic Implementation
+
+Let us take the example of a basic sort kernel which takes a set of columns as 
input, and returns a list of indices identifying the sorted order.

Review Comment:
   ```suggestion
   Let us take the example of a basic sort kernel which takes a set of columns 
as input, and returns a list of indices identifying a sorted order.
   ```



##########
_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-1.md:
##########
@@ -0,0 +1,217 @@
+---
+layout: post
+title: "Fast and Memory Efficient Multi-Column Sorts in Apache Arrow Rust, 
Part 1"
+date: "2022-10-30 00:00:00"
+author: "tustvold and alamb"
+categories: [arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+Sorting is one of the most fundamental operations in modern databases and 
other analytic systems, underpinning common operators such as aggregates, 
joins, window functions, merge, and more. By some estimates, more than half of 
the execution time in data processing systems is spent sorting. Optimizing 
sorts is therefore vital to improving query performance and overall system 
efficiency.
+
+Sorting is also one of the most well studied topics in computer science. The 
classic survey paper for databases is [Implementing Sorting in Database 
Systems](https://dl.acm.org/doi/10.1145/1132960.1132964) by Goetz Graefe which 
provides a thorough academic treatment that is still very applicable today. 
However, it may not be obvious how to apply the wisdom and advanced techniques 
described in that paper to modern systems.
+
+In this blog post we explain in detail the new [row 
format](https://docs.rs/arrow/25.0.0/arrow/row/index.html) in the [Rust 
implementation](https://github.com/apache/arrow-rs) of [Apache 
Arrow](https://arrow.apache.org/), and how this can be used to perform 
blazingly fast multi-column sorts. The excellent [DuckDB blog on 
sorting](https://duckdb.org/2021/08/27/external-sorting.html) highlights 
several sorting techniques, and mentions such a comparable row format, but it 
does not explain how to efficiently sort variable length strings or dictionary 
encoded data, which we do in this post.
+
+## Multicolumn / Lexicographical Sort Problem
+
+Most languages have native, optimized operations to sort a single column 
(array) of data, which are specialized based on the type of data being sorted. 
The reason that sorting is typically more challenging in analytic systems is 
that they must:
+
+1. Support sorting by multiple columns of data

Review Comment:
   ```suggestion
   Most languages have native, optimized operations to sort a single column 
(array) of data, which are specialized based on the type of data being sorted. 
The reason that sorting is typically more challenging in analytic systems is 
that:
   
   1. They must support sorting by multiple columns of data
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to