Re: The Sort Problem

Uri Guttman Sun, 15 Feb 2004 16:30:17 -0800

>>>>> "DC" == Damian Conway <[EMAIL PROTECTED]> writes:


  DC> Suppose C<sort>'s signature is:

  DC>      type KeyExtractor ::= Code(Any) returns Any;

  DC>      type Comparator   ::= Code(Any, Any) returns Int;

  DC>      type Criterion    ::= KeyExtractor
  DC>                          | Comparator
  DC>                          | Pair(KeyExtractor, Comparator)
  DC>                          ;

  DC>      type Criteria     ::= Criterion
  DC>                          | Array of Criterion
  DC>                          ;

  DC>      multi sub *sort(Criteria ?$by = {$^a cmp $^b}, [EMAIL PROTECTED]) {...}

  DC> Each of those sort criteria may be a block/closure, or a pair of
  DC> block/closures. Each block/closure may take either one argument (in
  DC> which case it's a key extractor) or two arguments (in which case it's
  DC> a comparator).

  DC> If a key-extractor block returns number, then C<< <=> >> is used to
  DC> compare those keys. Otherwise C<cmp> is used. In either case, the
  DC> returned keys are cached to optimize subsequent comparisons against
  DC> the same element.

i would make cmp the default as it is now. you sort strings much more
often than you sort numbers.

  DC> If a key-extractor/comparator pair is specified, the key-extractor is
  DC> the key of the pair and the comparator the value. The extractor is
  DC> used to retreive (and cache) keys, which are then passed to the
  DC> comparator.

  DC> Which means that (a slightly extended version of) Uri's proposed:

  DC>     @out = sort
  DC>        key( %lookup{ .{remotekey} } ),                                  #1
  DC>        key:insensitive:descending:locale( 'somewhere' )( .{priority} ), #2
  DC>        key:float ( substr( 0, 10 ) ),                                   #3
  DC>        key:integer ( /foo(\d+)bar/ ),                                   #4
  DC>        key:compare( { ^$a <=> ^$b } )( /(\d+)$/ ),                      #5
  DC>        key:compare( \&my_compare_sub ) ( /(\d+)$/ ),                    #6
  DC>        key:compare( { just_guess $^b, $^a } ),                          #7
  DC>                  @in;

  DC> would become:
                        
  DC>     @out = sort
  DC>        [ { ~ %lookup{ .{remotekey} } },                                 #1

if string cmp is the default, wouldn't that ~ be redundant?
and i keep forgetting that ~ is now the stringify operator :)

  DC>          { use locale 'somewhere'; lc .{priority} } is descending,      #2

hmmm, this needs more work IMO. requiring the coder to lc the key is
moving the case insensitivity feature back into code. wouldn't 'is
insensitive' be ok?

how does the internal guts get the descending/insensitive flags/traits
passed to it? i know it is an internals problem but i am just pondering
it.

  DC>          { + substr( 0, 10 ) },                                         #3
  DC>          { int /foo(\d+)bar/ },                                         #4

i would also expect int to be a default over float as it will be used
more often. + is needed there since the regex returns a string. in the
#3 case that would be an int as well. so we need a 'float' cast
thingy. BTW, the only way to get a number as a key is from a structure
where the field was assigned as a number/int. that may not happen a lot
so the int/float cast will probably be needed here to sort correctly


  DC>          { + m/(\d+)$/.[1] },                                           #5
  DC>          { /(\d+)$/ } => &my_compare_sub,                               #6

missing a close } it seems.

  DC>          { just_guess $^b, $^a },                                       #7

is that a reverse order sort? why not skip the args and do this:

               { &just_guess is descending },                                 #7

  DC>          ],

so the first arg to sort is either a single compare block or a anon list
of them? i figure we need the [] to separate the criteria from the input
data. but what about this odd case,

        sort [...], [...], [...]

now that is stupid code but it could be trying to sort the refs by their
address in string mode. or it could be a sort criteria list followed by
2 refs to input records.

  DC>                  @in;

  DC> So to specify a key extractor we provide a block with one argument, as
  DC> in examples #1 through #5. Then to specify that the key is compared
  DC> with C<cmp> we return a string (using unary C<~> to be explicit about
  DC> it) as in #1. To specify that the key is compared with C<< <=> >> we
  DC> return a number (using unary C<+> to be explicit about it) as in #3
  DC> and #5. If we think that C<sort> might be able to optimize integer
  DC> comparisons, we can explicitly return an integer, as in #4. If we want
  DC> locale-sensitive sorting, we specify that with the appropriate C<use
  locale> statement, as in #2.

  DC> To specify a comparator, we provide a block with two arguments, as in
  DC> #7. That block is always expected to return an integer.

so #7 is a call to just_guess which is passed the 2 args to compare. it
must return an int like cmp/<=>. as i pointed out above, i don't see why
you even need to show the ^$a and ^$b args? they will be passed into
just_guess that way. let is descending handle the sort ordering.

  DC> To specify a key-extractor *and* an associated comparator, we bind
  DC> them together in a Pair, as in #6.

that works for me.

  DC> The only tricky bits are how to specify a key extractor for keys that
  DC> are to be sorted in descending order and/or case-insensitively. The
  DC> case insensitivity could be handled by simple C<lc>'ing, as in
  DC> #2. Descending sort order could be handled by a trait on the
  DC> block/closure, as in #2. Alternatively, case-insensitivity could be
  DC> specified by trait as well.

ok, i like a trait better than coding in lc. better semantics and
clearer to read. the is descending trait could just cause the sort
callbacks to be called with reversed arguments and so the callback code
never has to worry about that.

  DC> Note that simple sorts would be unchanged:

  DC>      @sorted = sort @unsorted;
  DC>      @sorted = sort {$^b <=> $^a} @unsorted;

  DC> But now one could also very neatly code cases that formerly required
  DC> an Orcish or Transformed sort. For example, these:

  DC>      @sorted = sort {(%M{$^a}//-M $^a) <=> (%M{$^b}//-M $^b)} @unsorted;

wow, that is UGLY! but i get it after a few hours of study. :) just the
orcish maneuver but with //. i think you also mean //= there. 

  DC>      @sorted = map $_[1], sort {$^a[0] <=> $^b[0]}, map [-M,$_],
  DC>      @unsorted;

the familiar ST.

  DC> would both become:

  DC>      @sorted = sort {-M} @unsorted;

that still wants to be cached somehow as -M is expensive. but that is an
internal thing and the ST/GRT both can handle it. so -M there is a
simple key extraction on the files in @unsorted. assuming no internal
caching would this be correct?

        @sorted = sort {%M{$_} //= -M} @unsorted;

i assume //= will be optimized and -M won't be called if it is cached.

also where does %M get declared and/or cleared before this? can it be
done in the block:

        @sorted = sort {my %M ; %M{$_} //= -M} @unsorted;

another -M problem is that it currently returns a float so that must be
marked/cast as a float.

      @sorted = sort {float -M} @unsorted;

that cast causes a <=> compare (since i would rather see cmp be the
default) and also can let the internal GRT properly merge a float key
instead of an int. it should work without it (not sure how) but the
float marker is important to optimization and a proper comparison.
maybe the fact that the compiler knows -M returns a float can be used to
mark it internally and the explicit float isn't needed here. but data
from a user record will need to be marked as float as the compiler can't
tell.

anyhow, i am glad i invoked your name and summoned you into this
thread. :)

uri

-- 
Uri Guttman  ------  [EMAIL PROTECTED]  -------- http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs  ----------------------------  http://jobs.perl.org

Re: The Sort Problem

Reply via email to