alamb commented on code in PR #264:
URL: https://github.com/apache/arrow-site/pull/264#discussion_r1009385970
##########
_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-2.md:
##########
@@ -0,0 +1,240 @@
+---
+layout: post
+title: "Fast and Memory Efficient Multi-Column Sorts in Apache Arrow Rust,
Part 2"
+date: "2022-10-30 00:00:00"
+author: "tustvold and alamb"
+categories: [arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+In Part 1 <!-- TODO add link --> of this post, we described the problem of
Multi-Column Sorting and challenges to implementing it efficiently. This second
post explains how the new [row
format](https://docs.rs/arrow/25.0.0/arrow/row/index.html) in the [Rust
implementation](https://github.com/apache/arrow-rs) of [Apache
Arrow](https://arrow.apache.org/) sorts more quickly.
+
+
+## Row Format
+ Now that you have a taste for why a comparable byte array representation
is such a compelling primitive, you will be pleased to learn that the row
format added to arrow-rs is such a representation. The rest of this article
will explain how it works, but if you just want to use it, check out the
[docs](https://docs.rs/arrow/latest/arrow/row/index.html), the
[code](https://github.com/apache/arrow-rs/blob/07024f6/arrow/src/row/mod.rs#L105-L187),
and [bugtracker](https://github.com/apache/arrow-rs/issues).
Review Comment:
makes sense
##########
_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-1.md:
##########
@@ -0,0 +1,210 @@
+---
+layout: post
+title: "Fast and Memory Efficient Multi-Column Sorts in Apache Arrow Rust,
Part 1"
+date: "2022-10-30 00:00:00"
+author: "tustvold and alamb"
+categories: [arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+Sorting is one of the most fundamental operations in modern databases and
other analytic systems, underpinning common operators such as aggregates,
joins, window functions, merge, and more. By some estimates, more than half of
the execution time in data processing systems is spent sorting. Optimizing
sorts is therefore vital to improving query performance and overall system
efficiency.
+
+Sorting is also one of the most well studied topics in computer science. The
classic survey paper for databases is [Implementing Sorting in Database
Systems](https://dl.acm.org/doi/10.1145/1132960.1132964) by Goetz Graefe which
provides a thorough academic treatment that is still very applicable today.
However, it may not be obvious how to apply the wisdom and advanced techniques
described in that paper to modern systems.
+
+In this blog post we explain in detail the new [row
format](https://docs.rs/arrow/25.0.0/arrow/row/index.html) in the [Rust
implementation](https://github.com/apache/arrow-rs) of [Apache
Arrow](https://arrow.apache.org/), and how this can be used to perform
blazingly fast multi-column sorts. The excellent [DuckDB blog on
sorting](https://duckdb.org/2021/08/27/external-sorting.html) highlights
several sorting techniques, and mentions such a comparable row format, but it
does not explain how to efficiently sort variable length strings or dictionary
encoded data, which we do in this post.
+
+## Multicolumn / Lexicographical Sort Problem
+
+Most languages have native optimized operations to sort a single column
(array) of data and are specialized based on the type of data. The reason that
sorting is typically more challenging in analytic systems is that they must:
+1. Support sorting by multiple columns of data
+2. The column types are not knowable at compile time, and thus the compiler
can not typically generate optimized code.
+
+Multicolumn sorting is also referred to as lexicographical sorting in some
libraries.
+
+For example, given sales data for various customers and their state of
residence, a user might want to find the top 10 orders for each state. One way
to do so is to order the data first by State (ascending) and then by Orders
(descending)
+
+```sql
+Customer | State | Orders
+—--------+-------+-------
+12345 | MA | 10.12
+532432 | MA | 8.44
+12345 | CA | 3.25
+56232 | WA | 6.00
+23442 | WA | 132.50
+7844 | CA | 9.33
+852353 | MA | 1.30
+```
+
+(Note: While there are specialized ways for computing this particular query
other that sorting the entire input (“TopK”), they typically need the same
multi-column comparison operation described below, so we will use the
simplified example in our post but it does apply more broadly)
+
+## Basic Implementation
+
+Let us take the example of a basic sort kernel which takes a set of columns as
input, and returns a list of indices identifying the sorted order.
+
+```python
+> lexsort_to_indices([
+ [“MA”, “MA”, “CA”, “WA”, “WA”, “CA”, “MA”]
+])
+[2, 5, 0, 1, 6, 3, 4]
+
+> lexsort_to_indices([
Review Comment:
I agree we should omit sort options at this point in the blog as is not
critical to understanding lexsort (which is what this section is trying to get
out).
Are proposing to sort the values in "orders" ascending as well?
##########
_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-2.md:
##########
@@ -0,0 +1,240 @@
+---
+layout: post
+title: "Fast and Memory Efficient Multi-Column Sorts in Apache Arrow Rust,
Part 2"
+date: "2022-10-30 00:00:00"
+author: "tustvold and alamb"
+categories: [arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+In Part 1 <!-- TODO add link --> of this post, we described the problem of
Multi-Column Sorting and challenges to implementing it efficiently. This second
post explains how the new [row
format](https://docs.rs/arrow/25.0.0/arrow/row/index.html) in the [Rust
implementation](https://github.com/apache/arrow-rs) of [Apache
Arrow](https://arrow.apache.org/) sorts more quickly.
+
+
+## Row Format
+ Now that you have a taste for why a comparable byte array representation
is such a compelling primitive, you will be pleased to learn that the row
format added to arrow-rs is such a representation. The rest of this article
will explain how it works, but if you just want to use it, check out the
[docs](https://docs.rs/arrow/latest/arrow/row/index.html), the
[code](https://github.com/apache/arrow-rs/blob/07024f6/arrow/src/row/mod.rs#L105-L187),
and [bugtracker](https://github.com/apache/arrow-rs/issues).
+
+The Row Format is a variable length byte sequence created by concatenating the
encoded form of each column. The encoding for each column depends on both the
datatype as well as the requested sort options (ascending, descending, nulls
first and null last).
Review Comment:
I agree -- we can leave it alone until the last section.
##########
_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-2.md:
##########
@@ -0,0 +1,240 @@
+---
+layout: post
+title: "Fast and Memory Efficient Multi-Column Sorts in Apache Arrow Rust,
Part 2"
+date: "2022-10-30 00:00:00"
+author: "tustvold and alamb"
+categories: [arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+In Part 1 <!-- TODO add link --> of this post, we described the problem of
Multi-Column Sorting and challenges to implementing it efficiently. This second
post explains how the new [row
format](https://docs.rs/arrow/25.0.0/arrow/row/index.html) in the [Rust
implementation](https://github.com/apache/arrow-rs) of [Apache
Arrow](https://arrow.apache.org/) sorts more quickly.
+
+
+## Row Format
+ Now that you have a taste for why a comparable byte array representation
is such a compelling primitive, you will be pleased to learn that the row
format added to arrow-rs is such a representation. The rest of this article
will explain how it works, but if you just want to use it, check out the
[docs](https://docs.rs/arrow/latest/arrow/row/index.html), the
[code](https://github.com/apache/arrow-rs/blob/07024f6/arrow/src/row/mod.rs#L105-L187),
and [bugtracker](https://github.com/apache/arrow-rs/issues).
+
+The Row Format is a variable length byte sequence created by concatenating the
encoded form of each column. The encoding for each column depends on both the
datatype as well as the requested sort options (ascending, descending, nulls
first and null last).
+
+```
+ ┌─────┐ ┌─────┐ ┌─────┐
+ │ │ │ │ │ │
+ ├─────┤ ┌ ┼─────┼ ─ ┼─────┼ ┐ ┏━━━━━┳━━━━━━━━┓
+ │ │ │ │ │ │ ────────────▶┃ ┃ ┃
+ ├─────┤ └ ┼─────┼ ─ ┼─────┼ ┘ ┗━━━━━┻━━━━━━━━┛
+ │ │ │ │ │ │
+ └─────┘ └─────┘ └─────┘
+ ...
+ ┌─────┐ ┌ ┬─────┬ ─ ┬─────┬ ┐ ┏━━━━━━━━┓
+ │ │ │ │ │ │ ────────────▶┃ ┃
+ └─────┘ └ ┴─────┴ ─ ┴─────┴ ┘ ┗━━━━━━━━┛
+ Customer State Orders
+ UInt64 Utf8 F64
+
+ Input Arrays Row Format
+ (Columns)
+```
+
+### Unsigned Integers
+
+To encode a non-null unsigned integer, the byte `0x01` is written, followed by
the integer’s bytes starting with the most significant, i.e. big endian. A null
is encoded as a `0x00` byte, followed by the encoded bytes of the integer’s
zero value
+
+```
+ ┌──┬──┬──┬──┐ ┌──┬──┬──┬──┬──┐
+ 3 │03│00│00│00│ │01│00│00│00│03│
+ └──┴──┴──┴──┘ └──┴──┴──┴──┴──┘
+ ┌──┬──┬──┬──┐ ┌──┬──┬──┬──┬──┐
+ 258 │02│01│00│00│ │01│00│00│01│02│
+ └──┴──┴──┴──┘ └──┴──┴──┴──┴──┘
+ ┌──┬──┬──┬──┐ ┌──┬──┬──┬──┬──┐
+ 23423 │7F│5B│00│00│ │01│00│00│5B│7F│
+ └──┴──┴──┴──┘ └──┴──┴──┴──┴──┘
+ ┌──┬──┬──┬──┐ ┌──┬──┬──┬──┬──┐
+ NULL │??│??│??│??│ │00│00│00│00│00│
+ └──┴──┴──┴──┘ └──┴──┴──┴──┴──┘
+
+ 32-bit (4 bytes) Row Format
+ Value Little Endian
+```
+
+### Signed Integers
+
+In Rust and most modern computer architectures, signed integers are encoded
using [two's complement](https://en.wikipedia.org/wiki/Two%27s_complement),
where a number is negated by flipping all the bits, and adding 1. Therefore,
flipping the top-most bit and treating the result as an unsigned integer
preserves the order. This unsigned integer can then be encoded using the same
encoding for unsigned integers described in the previous section. For example
+
+```
Review Comment:
yes, that is a good call -- will update
##########
_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-2.md:
##########
@@ -0,0 +1,240 @@
+---
+layout: post
+title: "Fast and Memory Efficient Multi-Column Sorts in Apache Arrow Rust,
Part 2"
+date: "2022-10-30 00:00:00"
+author: "tustvold and alamb"
+categories: [arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+In Part 1 <!-- TODO add link --> of this post, we described the problem of
Multi-Column Sorting and challenges to implementing it efficiently. This second
post explains how the new [row
format](https://docs.rs/arrow/25.0.0/arrow/row/index.html) in the [Rust
implementation](https://github.com/apache/arrow-rs) of [Apache
Arrow](https://arrow.apache.org/) sorts more quickly.
+
+
+## Row Format
+ Now that you have a taste for why a comparable byte array representation
is such a compelling primitive, you will be pleased to learn that the row
format added to arrow-rs is such a representation. The rest of this article
will explain how it works, but if you just want to use it, check out the
[docs](https://docs.rs/arrow/latest/arrow/row/index.html), the
[code](https://github.com/apache/arrow-rs/blob/07024f6/arrow/src/row/mod.rs#L105-L187),
and [bugtracker](https://github.com/apache/arrow-rs/issues).
+
+The Row Format is a variable length byte sequence created by concatenating the
encoded form of each column. The encoding for each column depends on both the
datatype as well as the requested sort options (ascending, descending, nulls
first and null last).
+
+```
+ ┌─────┐ ┌─────┐ ┌─────┐
+ │ │ │ │ │ │
+ ├─────┤ ┌ ┼─────┼ ─ ┼─────┼ ┐ ┏━━━━━┳━━━━━━━━┓
+ │ │ │ │ │ │ ────────────▶┃ ┃ ┃
+ ├─────┤ └ ┼─────┼ ─ ┼─────┼ ┘ ┗━━━━━┻━━━━━━━━┛
+ │ │ │ │ │ │
+ └─────┘ └─────┘ └─────┘
+ ...
+ ┌─────┐ ┌ ┬─────┬ ─ ┬─────┬ ┐ ┏━━━━━━━━┓
+ │ │ │ │ │ │ ────────────▶┃ ┃
+ └─────┘ └ ┴─────┴ ─ ┴─────┴ ┘ ┗━━━━━━━━┛
+ Customer State Orders
+ UInt64 Utf8 F64
+
+ Input Arrays Row Format
+ (Columns)
+```
+
+### Unsigned Integers
+
+To encode a non-null unsigned integer, the byte `0x01` is written, followed by
the integer’s bytes starting with the most significant, i.e. big endian. A null
is encoded as a `0x00` byte, followed by the encoded bytes of the integer’s
zero value
+
+```
+ ┌──┬──┬──┬──┐ ┌──┬──┬──┬──┬──┐
+ 3 │03│00│00│00│ │01│00│00│00│03│
+ └──┴──┴──┴──┘ └──┴──┴──┴──┴──┘
+ ┌──┬──┬──┬──┐ ┌──┬──┬──┬──┬──┐
+ 258 │02│01│00│00│ │01│00│00│01│02│
+ └──┴──┴──┴──┘ └──┴──┴──┴──┴──┘
+ ┌──┬──┬──┬──┐ ┌──┬──┬──┬──┬──┐
+ 23423 │7F│5B│00│00│ │01│00│00│5B│7F│
+ └──┴──┴──┴──┘ └──┴──┴──┴──┴──┘
+ ┌──┬──┬──┬──┐ ┌──┬──┬──┬──┬──┐
+ NULL │??│??│??│??│ │00│00│00│00│00│
+ └──┴──┴──┴──┘ └──┴──┴──┴──┴──┘
+
+ 32-bit (4 bytes) Row Format
+ Value Little Endian
+```
+
+### Signed Integers
+
+In Rust and most modern computer architectures, signed integers are encoded
using [two's complement](https://en.wikipedia.org/wiki/Two%27s_complement),
where a number is negated by flipping all the bits, and adding 1. Therefore,
flipping the top-most bit and treating the result as an unsigned integer
preserves the order. This unsigned integer can then be encoded using the same
encoding for unsigned integers described in the previous section. For example
+
+```
+ ┌───┬───┬───┬───┐ ┌───┬───┬───┬───┐ ┌───┬───┬───┬───┬───┐
+ 5 │05 │00 │00 │00 │ │05 │00 │00 │128│ │01 │128│00 │00 │05 │
+ └───┴───┴───┴───┘ └───┴───┴───┴───┘ └───┴───┴───┴───┴───┘
+ ┌───┬───┬───┬───┐ ┌───┬───┬───┬───┐ ┌───┬───┬───┬───┬───┐
+ -5 │251│255│255│255│ │251│255│255│127│ │01 │127│255│255│251│
+ └───┴───┴───┴───┘ └───┴───┴───┴───┘ └───┴───┴───┴───┴───┘
+
+ Value 32-bit (4 bytes) High bit flipped Row Format
+ Little Endian
+```
+
+### Floating Point
+
+Floating point values can be ordered according to the [IEEE 754 totalOrder
predicate](https://en.wikipedia.org/wiki/IEEE_754#Total-ordering_predicate)
(implemented in Rust by
[f32::total_cmp](https://doc.rust-lang.org/std/primitive.f32.html#method.total_cmp)).
This ordering interprets the bytes of the floating point value as the
correspondingly sized, signed, little-endian integer, flipping all the bits
except the sign bit in the case of negatives.
+
+Floating point values are therefore encoded to row format by converting them
to the appropriate sized signed integer representation, and then using the same
encoding for signed integers described in the previous section.
+
+### Byte Arrays (Including Strings)
+
+Unlike primitive types above, byte arrays are variable length. For short
strings, such as `state` in our example above, it is possible to pad all values
to the length of the longest one with some fixed values such as `0x00` and
produce a fixed length row. This is the approach described in the DuckDB blog
for encode `c_birth_country`.
+
+However, often values in string columns differ substantially in length or the
maximum length is not known at the start of execution, making it inadvisable
and/or impractical to pad the strings to a fixed length, and thus the Rust
Arrow row format uses a variable length encoding.
+
+We need an encoding that unambiguously terminates the end of the byte array.
This not only permits recovering the original value from the row format, but
ensures bytes of a longer byte array won’t be compared against bytes from a
different field in a row with a shorter byte array during comparison.
+
+A null byte array is encoded as a single `0x00` byte. Similarly, an empty byte
array is encoded as a single `0x01` byte.
+
+To encode a non-null, non-empty array, first a single `0x02` byte is written.
Then the array is written in 32-byte blocks, with each block followed by a
`0xFF` byte as a continuation token. The final block is padded to 32-bytes with
`0x00`, and is then followed by the unpadded length of this final block as a
single byte.
+
+Note the following example encodings use a block size of 4 bytes, as opposed
to the 32 bytes for brevity
+
+```
+ ┌───┬───┬───┬───┬───┬───┬───┐
+ "MEEP" │02 │'M'│'E'│'E'│'P'│FF │04 │
+ └───┴───┴───┴───┴───┴───┴───┘
+
+ ┌───┬───┬───┬───┬───┬───┬───┐
+ "" │02 │00 │00 │00 │00 │FF │00 │
+ └───┴───┴───┴───┴───┴───┴───┘
+
+ NULL ┌───┬───┬───┬───┬───┬───┬───┐
+ │00 │00 │00 │00 │00 │FF │00 │
+ └───┴───┴───┴───┴───┴───┴───┘
+
+"Defenestration" ┌───┬───┬───┬───┬───┬───┐
+ │02 │'D'│'e'│'f'│'e'│FF │
+ └───┼───┼───┼───┼───┼───┤
+ │'n'│'e'│'s'│'t'│FF │
+ ├───┼───┼───┼───┼───┤
+ │'r'│'a'│'t'│'r'│FF │
+ ├───┼───┼───┼───┼───┤
+ │'a'│'t'│'i'│'o'│FF │
+ ├───┼───┼───┼───┼───┼───┐
+ │'n'│00 │00 │00 │FF │17 │
+ └───┴───┴───┴───┴───┴───┘
+```
+
+This approach is loosely inspired by [COBS
encoding](https://en.wikipedia.org/wiki/Consistent_Overhead_Byte_Stuffing), and
chosen over more traditional [byte
stuffing](https://en.wikipedia.org/wiki/High-Level_Data_Link_Control#Asynchronous_framing)
as it is more amenable to vectorization, in particular hardware with AVX-256
can copy a 32-byte block in a single instruction.
+
+### Dictionary Arrays
+Dictionary Encoded Data (called
[categorical](https://pandas.pydata.org/docs/user_guide/categorical.html) in
pandas) is increasingly important because they can store low cardinality data
very efficiently in parquet and similar formats.
Review Comment:
I guess another point that might be good to make is that the fact parquet
has special support for dictionary encoded columns is another indication of its
importance in the big data landscape
##########
_posts/2022-10-30-multi-column-sorts-in-arrow-rust-part-1.md:
##########
@@ -0,0 +1,210 @@
+---
+layout: post
+title: "Fast and Memory Efficient Multi-Column Sorts in Apache Arrow Rust,
Part 1"
+date: "2022-10-30 00:00:00"
+author: "tustvold and alamb"
+categories: [arrow]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+Sorting is one of the most fundamental operations in modern databases and
other analytic systems, underpinning common operators such as aggregates,
joins, window functions, merge, and more. By some estimates, more than half of
the execution time in data processing systems is spent sorting. Optimizing
sorts is therefore vital to improving query performance and overall system
efficiency.
+
+Sorting is also one of the most well studied topics in computer science. The
classic survey paper for databases is [Implementing Sorting in Database
Systems](https://dl.acm.org/doi/10.1145/1132960.1132964) by Goetz Graefe which
provides a thorough academic treatment that is still very applicable today.
However, it may not be obvious how to apply the wisdom and advanced techniques
described in that paper to modern systems.
+
+In this blog post we explain in detail the new [row
format](https://docs.rs/arrow/25.0.0/arrow/row/index.html) in the [Rust
implementation](https://github.com/apache/arrow-rs) of [Apache
Arrow](https://arrow.apache.org/), and how this can be used to perform
blazingly fast multi-column sorts. The excellent [DuckDB blog on
sorting](https://duckdb.org/2021/08/27/external-sorting.html) highlights
several sorting techniques, and mentions such a comparable row format, but it
does not explain how to efficiently sort variable length strings or dictionary
encoded data, which we do in this post.
+
+## Multicolumn / Lexicographical Sort Problem
+
+Most languages have native optimized operations to sort a single column
(array) of data and are specialized based on the type of data. The reason that
sorting is typically more challenging in analytic systems is that they must:
+1. Support sorting by multiple columns of data
+2. The column types are not knowable at compile time, and thus the compiler
can not typically generate optimized code.
+
+Multicolumn sorting is also referred to as lexicographical sorting in some
libraries.
+
+For example, given sales data for various customers and their state of
residence, a user might want to find the top 10 orders for each state. One way
to do so is to order the data first by State (ascending) and then by Orders
(descending)
+
+```sql
+Customer | State | Orders
+—--------+-------+-------
+12345 | MA | 10.12
+532432 | MA | 8.44
+12345 | CA | 3.25
+56232 | WA | 6.00
+23442 | WA | 132.50
+7844 | CA | 9.33
+852353 | MA | 1.30
+```
+
+(Note: While there are specialized ways for computing this particular query
other that sorting the entire input (“TopK”), they typically need the same
multi-column comparison operation described below, so we will use the
simplified example in our post but it does apply more broadly)
+
+## Basic Implementation
+
+Let us take the example of a basic sort kernel which takes a set of columns as
input, and returns a list of indices identifying the sorted order.
+
+```python
+> lexsort_to_indices([
+ [“MA”, “MA”, “CA”, “WA”, “WA”, “CA”, “MA”]
+])
+[2, 5, 0, 1, 6, 3, 4]
+
+> lexsort_to_indices([
+ [“MA”, “MA”, “CA”, “WA”, “WA”, “CA”, “MA”],
+ [10.10, 8.44, 3.25, 6.00, 132.50, 9.33, 1.30]
+])
+[5, 2, 0, 1, 6, 4, 3]
+```
+
+This function returns a list of indices instead of sorting the columns
directly because it:
+1. Avoids expensive copying data during the sorting process
+2. Allows deferring copying of values until the latest possible moment
+3. Can be used to reorder additional columns that weren’t part of the sort key
+
+
+A straightforward implementation of lexsort_to_indices uses a comparator
function,
+
+```text
+ row
+ index
+ ┌─────┐ ┌─────┐ ┌─────┐ compare(left_index, right_index)
+ 0 │ │ │ │ │ │
+ ┌ ┼─────┼ ─ ┼─────┼ ─ ┼─────┼ │ │
+ │ │ │ │ │ ││◀──────────────────┘ │
+ └ ┼─────┼ ─ ┼─────┼ ─ ┼─────┼ │
+ │ │ │ │ │ │Comparator function compares one │
+ ├─────┤ ├─────┤ ├─────┤ multi-column row with another. │
+ │ │ │ │ │ │ │
+ ├─────┤ ├─────┤ ├─────┤ The data types of the columns │
+ │ │ │ │ │ │ and the sort options are not │
+ └─────┘ └─────┘ └─────┘ known at compile time, only │
+ ... runtime │
+ │
+ ┌┌─────┐─ ─┌─────┐─ ─┌─────┐─ │
+ │ │ │ │ │ │◀┼────────────────────────────────┘
+ └├─────┤─ ─├─────┤─ ─├─────┤─
+ │ │ │ │ │ │
+ ├─────┤ ├─────┤ ├─────┤
+ N-1 │ │ │ │ │ │
+ └─────┘ └─────┘ └─────┘
+ Customer State Orders
+ UInt64 Utf8 F64
+```
+
+
+
+The comparator function compares each row a column at a time, based on the
column types and sort options
+
+```text
+ ┌────────────────────────────────┐
+ │ │
+ ▼ │
+ ┌ ─ ─ ─ ┐ ┌ ─ ─ ─ ┐ │
+ │
+ ┌─────┐ │┌─────┐│ │┌─────┐│ │
+left_index │ │ │ │ │ │ │
+ └─────┘ │└─────┘│ │└─────┘│ Step 1: Compare values in State
+ (UInt64) column
+ │ │ │ │
+
+ │ │ │ │
+ ┌─────┐ ┌─────┐ ┌─────┐
+ right_index│ │ ││ ││ ││ ││
+ └─────┘ └─────┘ └─────┘ Step 2: If values State are equal,
+ │ │ │ │ then compare values in Orders (F64)
+ Customer State Orders │
+ UInt64 │ Utf8 │ │ F64 │ │
+ ─ ─ ─ ─ ─ ─ ─ ─ │
+ ▲ │
+ │ │
+ └───────────────────────┘
+```
+Pseudocode might be
+
+```python
+# Takes a list of columns and returns the lexicographically
+# sorted order as a list of indices
+def lexsort_to_indices(columns):
+ comparator = build_comparator(columns)
+
+ # Construct a list of integers from 0 to the number of rows
+ # and sort it according to the comparator
+ [0..columns.num_rows()].sort_by(comparator)
+
+# Build a function that given indexes (left_idx, right_idx)
+# returns the comparison of the sort keys at the left
+# and right indices respectively
+def build_comparator(columns):
+ def comparator(left_idx, right_idx):
+ for column in columns:
+ # call a compare function which performs
+ # dynamic dispatch on type of left and right columns
+ ordering = compare(column, left_idx,right_idx)
+ if ordering != Equal {
+ return ordering
+ }
+ # All values equal
+ Equal
+ # Return comparator function
+ comparator
+
+ # compares the values in a single column at left_idx and right_idx
+ def compare(column, left_idx, right_idx):
+ # Choose comparison based on type of column (“dynamic dispatch”)
+ if column.type == Int:
+ cmp(column[left_idx].as_int(), column[right_idx].as_int())
+ elif column.type == Float:
+ cmp(column[left_idx].as_float(), column[right_idx].as_float())
+ ...
+```
+
+Greater detail is beyond the scope of this post, but in general the more
predictable the behavior of a block of code, the better its performance will
be. In the case of this pseudocode, there is clear room for improvement:
+
+1. comparator performs a large number of unpredictable conditional branches,
where the path execution takes depends on the data values
+2. comparator and compare use dynamic dispatch, which not only adds further
conditional branches, but also function call overheads
+3. The comparator perform large numbers of reads of memory at unpredictable
locations
+
+You can find the complete implementation of multi-column comparator
construction in arrow-rs in
[sort.rs](https://github.com/apache/arrow-rs/blob/f629a2ebe08033e7b78585d82e98c50a4439e7a2/arrow/src/compute/kernels/sort.rs#L905-L1036)
and
[ord.rs](https://github.com/apache/arrow-rs/blob/f629a2e/arrow/src/array/ord.rs#L178-L313).
+
+
+# Normalized Keys / Byte Array Comparisons
+
+Now imagine we had a way to represent each logical row of data as a sequence
of bytes, and that byte-wise comparison of that sequence yielded the same
result as comparing the actual column values using the code above. Such a
representation would require no switching on column types, and the kernel would
become
+
+```python
+def lexsort_to_indices(columns):
+ rows = convert_to_rows(columns)
+ [0..columns.num_rows()].sort_by(lambda l, r: cmp(rows[l], rows[r]))
+```
+
+While this approach does require converting to/from the byte array
representation, it has some major advantages:
+
+* Rows can be compared by comparing bytes in memory, which modern computer
hardware excels at with the extremely well optimized
[memcmp](https://www.man7.org/linux/man-pages/man3/memcmp.3.html)
+* Memory accesses are largely predictable
+* There is no dynamic dispatch overhead
+* Easily extensible to more sophisticated sorting strategies
+* Distribution-based sorting techniques such as radix sort, Parallel merge
sort, External sort, etc
+
+You can find more information on how to leverage such representation in the
“Binary String Comparison” section of the [DuckDB blog
post](https://duckdb.org/2021/08/27/external-sorting.html) on the topic as well
as [Graefe’s paper](https://dl.acm.org/doi/10.1145/1132960.1132964). However,
we found it wasn’t immediately obvious how to apply this technique to variable
length string or dictionary encoded data, which we will explain in the next
part of this article.
+
+
+## Next up: Nested and Hierarchical Data
Review Comment:
🤦
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]