Resorting to Sorting (doc)

Jeff 'japhy' Pinyan Tue, 16 Jul 2002 09:59:19 -0700

URL: http://www.pobox.com/~japhy/docs/sorting.html


NAME
    Resorting to Sorting

SYNOPSIS
    A guide to using Perl's `sort()' function to sort data in numerous ways.
    Topics covered include the Orcish maneuver, the Schwartzian Transform,
    the Guttman-Rosler Transform, radix sort, and sort-by-index.

DESCRIPTION
    Sorting data is a common procedure in programming -- there are efficient
    and inefficient ways to do this. Luckily, in Perl, the `sort()' function
    does the dirty work; Perl's sorting is handled internally by a
    combination of merge-sort and quick-sort. However, sorting is done, by
    default, on strings. In order to change the way this is done, you can
    supply a piece of code to the `sort()' function that describes the
    machinations to take place.

    We'll examine all differents sorts of sorts; some have been named after
    programmers you may have heard of, and some have more descriptive names.

CONTENT
  Table of Contents

    1 Naīve Sorting
        Poor practices that cause Perl to do a lot more work than necessary.

    2 The Orcish Maneuver
        Joseph Hall's implementation of "memoization" in sorting.

    3 Radix Sort
        A multiple-pass method of sorting; the time it takes to run is
        linearly proportional to the size of the largest element.

    4 Sorting by Index
        When multiple arrays must be sorted in parallel, save yourself
        trouble and sort the indices.

    5 Schwartzian Transforms
        Wrapping a `sort()' in between two `map()'s -- one to set up a data
        structure, and the other to extract the original information -- is a
        nifty way of sorting data quickly, when expensive function calls
        need to be kept to a minimum.

    6 Guttman-Rosler Transforms
        It's far simpler to let `sort()' sort as it will, and to format your
        data as something meaningful to the string comparisons `sort()'
        makes.

    7 Portability
        By giving sorting functions a prototype, you can make sure they work
        from anywhere!

  Naīve Sorting

    Ordinarily, it's not a difficult task to sort things. You merely pass
    the list to `sort()', and out comes a sorted list. Perl defaults to
    using a string comparison, offered by the `cmp' operator. This operator
    compares two scalars in ASCIIbetical order -- that means "1" comes
    before "A", which comes before "^", which comes before "a". For a
    detailed list of the order, see your nearest *ascii*(1) man page.

    To sort numerically, you need to supply `sort()' that uses the numerical
    comparison operator (dubbed the "spaceship" operator), `<=>':

      @sorted = sort { $a <=> $b } @numbers;  # ascending order
      @sorted = sort { $b <=> $a } @numbers;  # descending order

    There are two special variables used in sorting -- `$a' and `$b'. These
    represent the two elements being compared at the moment. The sorting
    routine can take a block (or a function name) to use in deciding which
    order the list is to be sorted in. The block or function should return -
    1 if `$a' is to come before `$b', 0 if they are the same (or, more
    correctly, if their position in the sorted list could be the same), and
    1 if `$a' is to come after `$b'.

    Sorting, by default, is like:

      @sorted = sort { $a cmp $b } @unsorted;

    That is, ascending ASCIIbetical sorting. You can leave out the block in
    that case:

      @sorted = sort @unsorted;

    Now, if we had a list of strings, and we wanted to sort them, *in a
    case-insensitive manner*. That means, we want to treat the strings as if
    they were all lower-case or upper-case. We could do something like:

      @sorted = sort { lc($a) cmp lc($b) } @unsorted;
      # or
      @sorted = sort { uc($a) cmp uc($b) } @unsorted;

        Note: There is a difference between these two sortings. There are some
        punctuation characters that come after upper-case letters and before
        lower-case characters. Thus, strings that *start* with such
        characters would be placed differently in the sorted list, depending
        on whether we use `lc()' or `uc()'.

    Now, this method of sorting is fine for small lists, but the `lc()' (or
    `uc()') function is called twice for *each* comparison. This might not
    seem bad, but think about the consequences of performing massive
    calculations on your data:

      sub age_or_name {
        my ($name_a, $age_a) = split /_/ => $a;
        my ($name_b, $age_b) = split /_/ => $b;
        return ($age_a <=> $age_b or $name_a cmp $name_b);
      }

      @people = qw( Jeff_19 Jon_14 Ray_18 Tim_14 Joan_20 Greg_19 );
      @sorted = sort age_or_name @people;
      # @sorted is now
      #   qw( Jon_14 Tim_14 Ray_18 Greg_19 Jeff_19 Joan_20 )

    This gets to be tedious. There's obviously too much work being done. We
    should only have to split the strings once.

    Exercises

    1   Create a sorting subroutine to sort by the length of a string, or, if
        needed, by its first five characters.

          @sorted = sort { ... } @strings;

    2   Sort the following data structure by the value of the key specified by
        the "cmp" key:

          @nodes = (
            { id => 17, size => 300, keys => 2, cmp => 'keys' },
            { id => 14, size => 104, keys => 9, cmp => 'size' },
            { id => 31, size => 2045, keys => 43, cmp => 'keys' },
            { id => 28, size => 6, keys => 0, cmp => 'id' },
          );

  The Orcish Maneuver

    This method of speeding up sorting comparisons was named by Joseph Hall.
    It uses a hash to cache values of the complex calculations you need to
    make:

      {
        my %cache;  # cache hash is only seen by this function

        sub age_or_name {
          my $data_a =
            ($cache{$a} ||= [ split /_/ => $a ]);
          my $data_b =
            ($cache{$b} ||= [ split /_/ => $b ]);
          return (
            $data_a->[1] <=> $data_b->[1]
            or
            $data_a->[0] <=> $data_b->[0]
          );
        }
      }

      @people = qw( Jeff_19 Jon_14 Ray_18 Tim_14 Joan_20 Greg_19 );
      @sorted = sort age_or_name @people;

    This procedure here uses a hash of array references to store the name
    and age for each person. The first time a string is used in the sorting
    subroutine, it doesn't have an entry in the `%cache' hash, so the `||'
    part is used.

    That is where this gets its name -- it is the OR-cache manuever, which
    can be lovingly pronounced "orcish".

    The main structure of Orcish sorting is:

      {
        my %cache;

        sub function {
          my $data_a = ($cache{$a} ||= mangle($a));
          my $data_b = ($cache{$b} ||= mangle($b));
          # compare as needed
        }
      }

    where `mangle()' is some function that does the necessary calculations
    on the data.

    Exercises

    1   Why should you make the caching hash viewable only by the sorting
        function? And how is this accomplished?

    2   Use the Orcish Manuever to sort a list of strings in the same way as
        described in the first exercise from "Naīve Sorting".

  Radix Sort

    If you have a set of strings of constant width (or that can easily be
    made in constant width), you can employ *radix sort*. This method gets
    around calling Perl's `sort()' function altogether.

    The concept of radix sort is as follows. Assume you have *N* strings of
    *k* characters in length, and each character can have one of *x* values
    (for ASCII, *x* is 256). We then create *x* "buckets", and each bucket
    can hold at most *N* strings.

    Here is a sample list of data for *N* = 7, *k* = 4, and *x* = 256: john,
    bear, roar, boat, vain, vane, zany.

    We then proceed to place each string into the bucket corresponding to
    the ASCII value of its rightmost character. If we were then to print the
    contents of the buckets after this first placement, our sample list
    would look like: vane, john vain bear roar boat zany.

    Then, we use the character immediately to the left of the one just used,
    and put the strings in the buckets accordingly. This is done in the
    order in which they are found in the buckets. The new list is: bear,
    roar, boat, john, vain, vane, zany.

    On the next round, the list becomes: vain, vane, zany, bear, roar, boat,
    john.

    On the final round, the list is: bear, boat, john, roar, vain, vane,
    zany.

    This amount of time this sorting takes is constant, and easily
    calculated. If we assume that all the data is the same length, then we
    take *N* strings, and multiply that by *k* characters. The algorithm
    also uses some extra space for storing the strings -- it needs an extra
    *Nk* bytes. If the data needs to be padded, there is some extra time
    involved (if a character is undefined, it is set as a NUL ("\0")).

    Here is a radix implementation. It returns the list it is given in
    ASCIIbetical order, like `sort @list' would.

      # sorts in-place (meaning @list gets changed)
      # set $unknown to true to indicate variable length
      radix_sort(\@list, $unknown);

      sub radix_sort {
        my ($data, $k) = @_;
        $k = !!$k;  # turn any true value into 1

        if ($k) { $k < length and $k = length for @$data }
        else { $k = length $data->[0] }

        while ($k--) {
          my @buckets;
          for (@$data) {
            my $c = substr $_, $k, 1;  # get char
            $c = "\0" if not defined $c;
            push @{ $buckets[ord($c)] }, $_;
          }

          @$data = map @$_, @buckets;  # expand array refs
        }
      }

    You'll notice the first argument to this function is an array
    *reference*. By doing this, we save copying a potentially large list,
    thus taking up less space, and running faster. If, for beauty reasons,
    you'd prefer not to backslash your array, you could use prototypes:

      sub radix_sort (\@;$);

      radix_sort @list, $unknown;

      sub radix_sort (\@;$) {
        # ...
      }

    You could combine the declaration and the definition of the function,
    but the prototype must be seen before the function call.

    Exercises

    1   Why does radix sort start with the right-most character in a string?

    2   Does the order of the elements in the input list effect the run-time of
        this sorting algorithm? What happens if the elements are already
        sorted? Or in the reverse sorted order?

  Sorting by Index

    Given the choice between sorting three lists and sorting one list, you'd
    choose sorting one list, right? Good. This, then, is the strategy
    employed when you sort by *index*. If you have three arrays that hold
    different information, yet for a given index, the elements are all
    related -- we say these arrays hold data in *parallel* -- then it seems
    far too much work to sort all three arrays.

      @names  = qw( Jeff Jon Ray Tim Joan Greg );
      @ages   = qw( 19   14  18  14  20   19   );
      @gender = qw( m    m   m   m   f    m    );

    Here, all the data at index 3 ("Tim", 14, "m") is related, as it is for
    any other index. Now, if we wanted to sort these arrays so that this
    relationship stood, but the lists were sorted in terms of age and then
    by name, then we would like our data to look like:

      @names  = qw( Jon Tim Ray Greg Jeff Joan );
      @ages   = qw( 14  14  18  19   19   20   );
      @gender = qw( m   m   m   m    m    f    );

    But to actually sort *these* lists requires 3 times the effort. Instead,
    we will sort the *indices* of the arrays (from 0 to 5). This is the
    function we will use:

      sub age_or_name {
        return (
          $ages[$a] <=> $ages[$b]
          or
          $names[$a] cmp $names[$b]
        )
      }

    And here it is in action:

      @idx = sort age_or_name 0 .. $#ages;
      print "@ages\n";        # 19 14 18 14 20 19
      print "@idx\n";         #  1  3  2  5  0  4
      print "@ages[@idx]\n";  # 14 14 18 19 19 20

    As you can see, the array isn't touched, but the indices are given in
    such an order that fetching the elements of the array in that order
    yields sorted data.

        Note: the `$#ages' variable is related to the `@ages' array -- it holds
        the highest index used in the array, so for an array of 6 elements,
        `$#array' is 5.

  Schwartzian Transforms

    A common (and rather popular) idiom in Perl programming is the
    Schwartzian Transform, an approach which is like "you set 'em up, I'll
    knock 'em down!" It uses the `map()' function to transform the incoming
    data into a list of simple data structures. This way, the machinations
    done to the data set are only done once (as in the Orcish Manuever).

    The general appearance of the transform is like so:

      @sorted =
        map { get_original_data($_) }
        sort { ... }
        map { transform_data($_) }
        @original;

    They are to be read in reverse order, since the first thing done is the
    `map()' that transforms the data, then the sorting, and then the `map()'
    to get the original data back.

    Let's say you had lines of a password file that were formatted as:

      username:password:shell:name:dir

    and you wanted to sort first by shell, then by name, and then by
    username. A Schwartzian Transform could be used like this:

      @sorted =
        map { $_->[0] }
        sort {
          $a->[3] cmp $b->[3]
          or
          $a->[4] cmp $b->[4]
          or
          $a->[1] cmp $b->[1]
        }
        map { [ $_, split /:/ ] }
        @entries;

    We'll break this down into the individual parts.

    Step 1. Transform your data.
        We create a list of array references; each reference holds the
        original record, and then each of the fields (as separated by
        colons).

          @transformed = map { [ $_, split /:/ ] } @entries;

        That could be written in a `for'-loop, but understanding `map()' is
        a powerful tool in Perl.

          for (@entries) {
            push @transformed, [ $_, split /:/ ];
          }

    Step 2. Sort your data.
        Now, we sort on the needed fields. Since the first element of our
        references is the original string, the username is element 1, the
        name is element 4, and the shell is element 3.

          @transformed = sort {
            $a->[3] cmp $b->[3]
            or
            $a->[4] cmp $b->[4]
            or
            $a->[1] cmp $b->[1]
          } @transformed;

    Step 3. Restore your original data.
        Finally, get the original data back from the structure:

          @sorted = map { $_->[0] } @transformed;

    And that's all there is to it. It may look like a daunting structure,
    but it is really just three Perl statements strung together.

  Guttman-Rosler Transforms

    Perl's regular sorting is very fast. It's optimized. So it'd be nice to
    be able to use it whenever possible. That is the foundation of the
    Guttman-Rosler Transform, called the GRT, for short.

    The frame of a GRT is:

      @sorted =
        map { restore($_) }
        sort
        map { normalize($_) }
        @original;

    An interesting application of the GRT is to sort strings in a case-
    insensitive manner. First, we have to find the longest run of `NUL's in
    all the strings (for a reason you'll soon see).

      my $nulls = 0;

      # find length of longest run of NULs
      for (@original) {
        for (/(\0+)/g) {
          $nulls = length($1) if length($1) > $nulls;
        }
      }

      $NUL = "\0" x ++$nulls;

    Now, we have a string of nulls, whose length is one greater than the
    largest run of nulls in the strings. This will allow us to safely
    separate the lower-case version of the strings from the original
    strings:

      # "\L...\E" is like lc(...)
      @normalized = map { "\L$_\E$NUL$_" } @original;

    Now, we can just send this to sort.

      @sorted = sort @normalized;

    And then to get back the data we had before, we split on the nulls:

      @sorted = map { (split /$NUL/)[1] } @original;

    Putting it all together, we have:

      # implement our for loop from above
      # as a function
      $NUL = get_nulls(\@original);

      @sorted =
        map { (split /$NUL/)[1] }
        sort
        map { "\L$_\E$NUL$_" }
        @original;

    The reason we use the `NUL' character is because it has an ASCII value
    of 0, so it's always less than or equal to any *other* character.
    Another way to approach this is to pad the string with nulls so they all
    become the same length:

      # see Exercise 1 for this function
      $maxlen = maxlen(\@original);

      @sorted =
        map { substr($_, $maxlen) }
        sort
        map { lc($_) . ("\0" x ($maxlen - length)) . $_ }
        @original;

    Common functions used in a GRT are `pack()', `unpack()', and `substr()'.
    The goal of a GRT is to make your data presentable as a string that will
    work in a regular comparison.

    Exercises

    1   Write the `maxlen()' function for the previous chunk of code.

  Portability

    You can make a function to be used by `sort()' to avoid writing
    potentially messy sorting code inline. For example, our Schwartzian
    Transform:

      @sorted =
        map { $_->[0] }
        sort {
          $a->[3] cmp $b->[3]
          or
          $a->[4] cmp $b->[4]
          or
          $a->[1] cmp $b->[1]
        }
        map { [ $_, split /:/ ] }
        @entries;

    could be rewritten like so:

      @sorted =
        map { $_->[0] }
        sort passwd_cmp
        map { [ $_, split /:/ ] }
        @entries;

      sub passwd_cmp {
        $a->[3] cmp $b->[3]
        or
        $a->[4] cmp $b->[4]
        or
        $a->[1] cmp $b->[1]
      }

    However, if you want to declare that function in one package, and use it
    in another, you run into problems.

      #!/usr/bin/perl -w

      package Sorting;

      sub passwd_cmp {
        $a->[3] cmp $b->[3]
        or
        $a->[4] cmp $b->[4]
        or
        $a->[1] cmp $b->[1]
      }

      sub case_insensitive_cmp {
        lc($a) cmp lc($b)
      }

      package main;

      @strings = sort Sorting::case_insensitive_cmp
        qw( this Mine yours Those THESE nevER );

      print "<@strings>\n";

      __END__
      <this Mine yours Those THESE nevER>

    This code doesn't change the order of the strings. The reason is because
    `$a' and `$b' in the sorting subroutine belong to `Sorting::', but the
    `$a' and `$b' that `sort()' is making belong to `main::'.

    To get around this, you can give the function a prototype, and then it
    will be passed the two arguments.

      #!/usr/bin/perl -w

      package Sorting;

      sub passwd_cmp ($$) {
        local ($a, $b) = @_;
        $a->[3] cmp $b->[3]
        or
        $a->[4] cmp $b->[4]
        or
        $a->[1] cmp $b->[1]
      }

      sub case_insensitive_cmp ($$) {
        local ($a, $b) = @_;
        lc($a) cmp lc($b)
      }

      package main;

      @strings = sort Sorting::case_insensitive_cmp
        qw( this Mine yours Those THESE nevER );

      print "<@strings>\n";

      __END__
      <Mine nevER THESE this Those yours>

AUTHOR
    Jeff `japhy' Pinyan, [EMAIL PROTECTED]

    http://www.pobox.com/~japhy/


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Resorting to Sorting (doc)

Reply via email to