There are a family of interesting algorithms that sort lexicographically on strings of bits or bytes rather than by comparing items pairwise with some kind of customizable comparison function on individual items.
Optimal comparison sorts such as heapsort, quicksort, mergesort, library sort, and sorting with a red-black tree are often described as having O(N lg N) runtime, but in fact this is only true if we assume individual comparisons take constant time. In fact, though, individual lexicographical comparisons can take up to O(N) time as well, if N is the total size of the input in bits; consider comparing “aaaaaaaaaaaaaaaab” to “aaaaaaaaaaaaaaaac”. Consequently their average runtimes are slightly worse, and their worst-case runtimes are much worse. Radix sorting algorithms, by contrast, can operate in O(N) time! This sounds like an extraordinary claim, since they still must do at least O(lg M) bit comparisons per item if you have M unique items; but, the fact is, the length of each of M unique items is also O(lg M), so the total size N of the input is O(M lg M). So they’re still O(M lg M) in the number of items, but O(N) in the number of bits in the input. (If you have nonunique input items, you can count the instances of each unique item in linear time while you sort the unique items.) There are similar advantages associated with radix trees (called “tries”, invented by Fredkin), in particular Patricia (called “crit-bit” by Bernstein). The trouble with radix-sorting algorithms is that you have to encode your data in a form where lexicographical comparison results in the order that you want. For most kinds of data, this is fairly easy, but not all. There's an excellent article by [Guttman and Rosler][] about the practicalities of generating string keys to sort various types of data in Perl, so that you don't have to use a written-in-Perl comparison function. This is also applicable to the problem of sorting arbitrary data with radix sort. [Guttman and Rosler]: http://www.sysarch.com/Perl/sort_paper.html (A Fresh Look at Efficient Perl Sorting, Uri Guttman and Larry Rosler, from I think 1999) There's also the [`strxfrm` function in the standard C library] [strxfrm], which generates such a string key specifically for locale-specific comparison of human-language strings (so that you sort accented characters into the place where they are traditionally sorted.) P.J. Plauger wrote [an article in 1992 on how to implement it] [Plauger] with a DSL to add new locales; this [seems to be similar to the way glibc does it] [glibc] — see e.g. the [Swedish locale collation definition] [Swedish] that uClibc shares with glibc. [strxfrm]: http://www.manpagez.com/man/3/strxfrm/ [Plauger]: http://drdobbs.com/tools/199902389?pgno=1 (“'All sorts of sorts', Computer Language magazine, January 1992, reprinted in Programming on Purpose III and the Dr Dobbs Journal blog) [glibc]: http://repo.or.cz/w/glibc.git/blame/4b10dd6c1959577f57850ca427a94fe22b9f3299:/string/strxfrm.c [Swedish]: http://cristi.indefero.net/p/uClibc-cristi/source/tree/3e8c046f98134be9c6ddd954141b6b13bf919199/extra/locale/collation/sv_SE In particular, ratios --- rational numbers --- are fairly easy to define a custom comparison function on, but encoding them for lexicographical comparison seems difficult. [Jules Jacobs proposed][] using the position of a rational number in the Stern-Brocot tree. Unfortunately, his encoding is not self-delimiting and unusably inefficient; for example, it encodes the integer N by a sequence of N 1 bits. In response to [Jacobs’ query on Math Overflow] [], Gjergji Zaimi said, “It is easy to prove that for any countable linearly ordered set there is an order preserving injection to the rationals. This can be proven by enumerating the base set and then specifying the values of the mapping by induction.” So, in a sense, then solving the radix-sortable encoding problem for rational numbers solves the problem for all linearly-ordered countable infinite sets, but Zaimi’s construction, while constructive, is even less efficient than Jacobs’s construction. Nevertheless, Jacobs and Zaimi proved that it is always theoretically possible to sort elements of a countable linearly-ordered set with a radix sort. Now it only remains to discover whether it’s possible to do so efficiently. [Jules Jacobs proposed]: http://www.reddit.com/r/programming/comments/g6a4s/the_american_flag_sort_an_efficient_inplace_radix/c1lbkjg (Reddit comment in or near March 2011 by user julesjacobs, beginning “Oh, I just thought of a method :)”) [Jacobs’ query on Math Overflow]: http://mathoverflow.net/questions/58917/injections-to-binary-sequences-that-preserve-order (Math Overflow question on 2011-03-19, “Injections to binary sequences that preserve order”, by Jules) The rest of this post describes an encoding of arbitrary-precision rational numbers that is self-delimiting, lexicographically ordered, and fairly compact. It is CPU-efficient enough that you could practically write programs with it, but it is probably two orders of magnitude less efficient as a numerical representation than raw binary. Lexicographical comparison --------------------------- Let <=> be the comparison operator, returning one of {"<", "=", ">"}. Given a definition for it on some atomic type, we extend it to operate on sequences of that type “lexicographically”, as follows: Let F = (A0 <=> B0). Then: ([] <=> [B0, ...]) = "<" ([A0, ...] <=> []) = ">" (([A0, A1, A2, ...] <=> [B0, B1, B2, ...]) = ([A1, A2, ...] <=> [B1, B2, ...]) if F = "="; = F otherwise. A function f is “order-preserving” if (f(N) <=> f(M)) = (N <=> M) for all N, M in the domain of f. If f’s range consists of some kind of sequence, we’ll call f a “lexicographically sortable encoding”. Self-delimiting --------------- An encoding f is “self-delimiting” if no string it produces is a prefix of any other string it produces, i.e. there are no N, M, and i such that f(N) = f(M)[:i] and N != M. This means that you can concatenate other crap onto the end of it and still decode it, and then you know where the encoding ends and the other crap begins, which is handy if the other crap has useful data in it. In particular, this is useful in two practical cases: 1. Sequence encoding: if you have a sequence of items in the domain of f, you can produce an encoding of the sequence by concatenating the encodings of the items. Furthermore, this new encoding function is lexicographically sortable iff f is. 2. Packing into fields: modern computers can do rapid parallelized lexicographical comparison of 32-bit or 64-bit chunks, but to do that conveniently, the chunks need to be stored in memory aligned at least to byte boundaries, and probably to 32- or 64-bit boundaries. That means you need to append some crap to pad bitstrings out to fill entire bytes or 32-bit or 64-bit boundaries, and it’s desirable if that can’t cause comparisons to return the wrong answer or the encoding to become undecodable. The definition above doesn’t imply that it’s necessarily computationally feasible or even decidable to find the end of the bitstring produced by f, but it is easy for all the encodings I consider here. Encoding of whole numbers: E ---------------------------- I found a lovely lexicographically sortable encoding of the whole or natural numbers 1, 2, 3, etc., as bitstrings {0,1}* --- all of them, not just those less than some arbitrary limit such as 2**32. It is self-delimiting. You can construct a lovely self-delimiting sortable encoding as follows. First calculate b(N) = floor(lg N), where lg N is log N / log 2. This b is one less than the number of bits in the binary encoding of N: b(1) = 0, b(2) = 1, b(3) = 1, b(4) = 2, etc. Let p(N) = "1" * b(N) || "0" --- b(N) 1 bits, followed by a 0 bit. Let e(N) be the last b(N) bits of the binary encoding of N. This will leave off a single 1 bit from the beginning of the binary encoding. Let E(N), the encoding of N, be p(N) || e(n). Encodings of the first few whole numbers follow, with spaces inserted for clarity: 1 -> 0 9 -> 111 0 001 2 -> 1 0 0 10 -> 111 0 010 3 -> 1 0 1 11 -> 111 0 011 4 -> 11 0 00 12 -> 111 0 100 5 -> 11 0 01 13 -> 111 0 101 6 -> 11 0 10 14 -> 111 0 110 7 -> 11 0 11 15 -> 111 0 111 8 -> 111 0 000 16 -> 1111 0 0000 If b(N) = b(M), then p(N) = p(M), so (E(N) <=> E(M)) = (e(N) <=> e(M)) = (N <=> M), which is what we wanted. If b(N) != b(M), assume WOLOG b(N) < b(M). So N < M. Both p(N) and p(M) begin with b(N) 1 bits, but then they differ: the bit b(N) will be a 0 in p(N) and a 1 in p(M). Therefore p(N) <=> p(M) gives us "<", as does E(N) <=> E(M), as does N <=> M. You could argue that this encoding is not particularly efficient, since it uses 2b(N)+1 bits, while the binary encoding of the number would only use b(N)+1 bits. Surely, you might say, there must be a less voluminous way to delimit the numbers. However, all such questions of coding efficiency depend on the random distribution being assumed of the numbers to be encoded, and that distribution is necessarily nonuniform. I assert without proof that this encoding is nearly optimal for the case where the numbers are distributed such that 2**-M of them have M bits for all natural numbers M, i.e. half of them are 1, a quarter of them are 2 or 3, an eighth of them are in the range 4-7, etc. Because the encoding is self-delimiting, you can bitwise-concatenate a sequence of number representations, and lexicographical comparisons on the resulting bitstrings will map to lexicographical comparisons on the underlying sequences of numbers. Encoding of nonnegative numbers: EN ----------------------------------- Add 1, then encode as above: EN(N) = E(N+1) That’s lexicographically sortable because adding 1 is order-preserving. For example, EN(2) = E(3) = 101. Reversing the lexicographical order of an encoding on bitstrings ---------------------------------------------------------------- Complement the bits. If it was self-delimiting before, it’s still self-delimiting. Encoding of integers: ES ------------------------ ES(N) = "1" || EN(N) if N >= 0 ~("1" || E(N)) otherwise If negative, encode the absolute value as a whole number as above, prepend a 1 bit, and complement. E.g.: -1 -> 1 -> 0 -> 10 -> 01 -2 -> 2 -> 100 -> 1100 -> 0011 -3 -> 3 -> 101 -> 1101 -> 0010 -4 -> 4 -> 11000 -> 111000 -> 000111 If nonnegative, encode as a nonnegative number as above and prepend a 1 bit. E.g.: 0 -> 1 -> 0 -> 10 1 -> 2 -> 100 -> 1100 2 -> 3 -> 101 -> 1101 3 -> 4 -> 11000 -> 111000 The leading 1 bit ensures that negative numbers precede nonnegative numbers. The bit complementation puts the negative numbers in reverse order, so that -1 is the greatest of all of them. So this encoding is lexicographically sortable. Encoding of whole numbers or infinity: EI ----------------------------------------- Infinity should compare greater than any whole number. That’s easy enough: encode the sequence (1, N) for whole numbers N: 00, 0100, 0101, 011000, etc., and use 2 for infinity: 100. However, this uses up a whole bit, devoting essentially half of the space of numbers to infinity. If infinity is not quite so probable, you can use a sort of bit-stuffing: first encode the whole number, and then insert a 0 bit into its encoding if it begins with too many 1 bits; then use a longer sequence of 1 bits to represent infinity. Let the number of 1 bits B = 2. Let F = E(N) if N is finite. EI(N) = "1" * B || "0" || F[B+1:] if F starts with "1" * B; or F otherwise if N is finite; or "1" * (B+1) if N is infinite. This encoding is still lexicographically ordered and self-delimiting. This encoding leaves untouched the encoding of the numbers 1, 2, and 3, while adding a single bit to higher numbers, and encodes infinity in three bits. Under our earlier assumption about the distribution of numbers to be encoded, one-quarter of the numbers to encode are greater than 3, so this encoding adds a quarter of a bit per number; so it is optimal when infinity occurs with probability 1/8. You can adjust B to a greater or lower value if infinity is more or less probable. Encoding of continued fractions: EF ----------------------------------- A regular continued fraction consists of a sequence of quotients [Q0; Q1, Q2, ...]., representing the number 1 Q0 + --------------- 1 Q1 + --------- Q2 + ... A rational number has a finite sequence of such quotients. For example, 53/24 is [2; 4, 1, 4], i.e. 1 2 + ------------- 1 4 + -------- 1 1 + --- 4 You can see from this representation that [3; ...] is a larger number than [2; 4, 1, 4], but [2; 5, ...] is smaller, while [2; 4, 2, ...] is a larger number again. In general, you can compare continued fractions by comparing their strings of quotients lexicographically, but reversing the sense of the inequality if the first unequal quotient is at an odd offset. You can eliminate the case where lexicographical comparison runs out of quotients in one continued fraction or the other by appending a terminating infinity to the list, yielding e.g. [2, 4, 1, 4, infinity]. This also makes the encoding self-delimiting, since you can’t have an infinity in the middle; nothing after it would matter. So you can encode a sequence of continued fractions by concatenating their individual representations, e.g. [53, infinity, 24, infinity, 2, 4, 1, 4, infinity, 0, 2, infinity] represents [53, 24, 53/24, 1/2]. In mathematical terms it is somewhat dubious to treat infinity like a number, but it is okay for our purposes here. So we can construct a lexicographically sortable encoding of rational numbers as sequences of integers by negating the odd-offset quotients. So we encode 53/24 as [2, -4, 1, -4, infinity]. This can give rise to negative infinities as well. Unfortunately this gives us an encoding on sequences of integers and positive and negative infinities, but we want an encoding as bitstrings. We could hack a negative infinity onto the signed integer encoding ES, but this is somewhat wasteful: all the quotients except the first one are positive until we invert their signs, so the sign bits are redundant. Instead, let’s encode the first quotient, which must be finite but could be zero or negative, using ES, and then encode the others --- which are whole numbers or infinity --- with EI, giving us a sequence of bitstrings: [2, 4, 1, 4, infinity] -> ["1100", "110000", "0", "110000", "111"] and then complement the odd-index items in the sequence to reverse the direction of their comparison: ["1101", "001111", "0", "001111", "111"] and finally concatenate into a bitstring, shown here with spaces for clarity: 1101 0011 1100 0111 1111 This encodes 53/24 in 20 bits, which is probably about as compact as you can expect an arbitrary-precision rational-number format to be for a number like 53/24, which contains 11 bits of precision already. As another example, -53/24 is [-3; 1, 3, 1, 4], which encodes as 0010 1 101 1 110000 000 which is 17 bits. In short: EF([A0, ...]) = ES(A0) || EF1([...]) EF1([]) = ~EI(infinity) EF1([A1, ...]) = ~(EI(A1) || EF1([...])) Collected definitions without explanation ----------------------------------------- EF([A0, ...]) = ES(A0) || EF1([...]) EF1([]) = ~EI(infinity) EF1([A1, ...]) = ~(EI(A1) || EF1([...])) EI(N) = "1" * B || "0" || F[B+1:] if F starts with "1" * B; or F otherwise if N is finite; or "1" * (B+1) if N is infinite, where F = E(N) if N is finite. ES(N) = "1" || EN(N) if N >= 0 ~("1" || E(N)) otherwise EN(N) = E(N+1) E(N) = p(N) || e(n) p(N) = "1" * b(N) || "0" e(N) = the last b(N) bits of the binary encoding of N b(N) = floor(lg N), where lg N is log N / log 2. Some properties of this encoding -------------------------------- Is the encoding optimal, or at least close to optimal? Well, that depends on the probability distribution of continued-fraction quotients, which in turn depends in a complicated way on the probability distribution of the rational numbers that are being encoded as continued fractions. Absent a probability distribution over the rational numbers, the best we can do is use empirical tests and look for redundancy. Empirically it seems to do pretty well on the numbers I’ve tried it on, and I don’t see how it could blow up too badly. At worst, e.g. for large integers, it’s slightly over a factor of 2 worse than a specialized encoding for numbers like that. There *is* a little bit of waste in the last quotient before the infinity, which cannot be a 1. You would think that maybe you could subtract 1 from it before encoding, but that wrecks comparisons between e.g. [0; 2] (one half) and [0; 2, 2] (two fifths). If you have the liberty to produce bits of the encoding lazily, you can use it to represent irrational numbers as well as rational ones. However, in that case, equality becomes formally undecidable. -- To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-tol