Re: summarizing a list of numbers?

Jeff 'japhy' Pinyan Mon, 17 Dec 2001 07:16:32 -0800

On Dec 17, Kredler Stefan said:

>I want to transform a sorted list into a compact list (if this is the
>correct term). 
>e.g. my list is 
>
>9,11,12,13,14,23,25,26,27,50   and want to have something like
>9,11-14,23,25-27,50            (to pass on to a unix-command).


I love this. :)

There are two approaches.  One of them is the computer science way, and
the other is the diehard-regex-hacker way.  I'll show you both (the regex
way requires Perl 5.6, by the way).

Both methods use the same logic, but the regex way is far more compact.

The computer science way is to split the string into a list of numbers,
and then go through the list one number at a time.  Keep track of the
first number of the potential "range", as well as the last number
seen.  If the current number is one more than the last number, then keep
going.  Once you get a number that is NOT one more than the previous, you
generate a range.  Here's the code:

  # input "N1,N2,N3,N4..."
  # assumes the numbers are sorted!!!

  sub list2range {
    my $list = shift;
    my ($first, $last);
    my @output;

    # remove the first number in the list...
    # and set $first and $last to it
    $list =~ s/^(\d+),?// and $first = $last = $1;

    for (split /,/ => $list) {
      # next number in the range
      if ($_ == $last + 1) {
        $last = $_;
        next;  # get the next number
      }

      # otherwise, we're done with the current range
      # don't use "3-3", just use "3"
      if ($first == $last) {
        push @output, $first;
      }
      else {
        push @output, "$first-$last";
      }

      # now $_ is the first number in a new range
      $first = $last = $_;
    }       

    return join "," => @output;
  }

There we go.  Now for the regex version.

  sub list2range {
    my $list = shift;
    $list =~ s<
      \b                  # boundary (see below for why)
      (\d+)               # capture to $1 a number

      (?:                 # this chunk...
        ,                   # a comma
        (                   # capture to $2 (and indirectly to $+)
          (??{ 1 + $+ })      # one more than the last number we matched 
        )
        \b                  # followed by a boundary
      )+                  # ... one or more times
    ><$1-$+>gx;           # replace with "first-last"
    return $list;
  }

Whew.  Much shorter, no? ;)  Just a bit of explanation... the last
captured part of a regex is stored in $+.  The first time we execute the
(??{ 1 + $+ }) part of the regex, $+ refers to $1, which is the first
number we've matched.  However, you'll notice that (??{ ... }) is inside
parentheses itself!  That means that once it has matched, $2 is set to
whatever it matched, and that means that $+ is set to that value as well,
so from then on, $+ is referring to what ((??{ 1 + $+ })) matched the last
time.

I've got \b (word boundaries) in here to make sure we're matching the
WHOLE number.  We don't want a false positive from something like

  10,11,12,135

Since 135 *starts* with 13, we don't want Perl to think the range is
"10-135".  So we ensure that the first number is preceded by a boundary,
and the last number is followed by boundary, so that we're not skipping
digits.

Which method should you use?  Eh, it's up to you.  I like the regex
approach, because it shows how useful the dynamic regex construct (the
(??{ ... }) thing) can be.  It reduced a sizeable algorithm down to one
simple regex:

  s/\b(\d+)(?:,((??{1+$+})\n))+/$1-$+/g;

But if you'd feel better using the more "general" approach, then by all
means, do.

-- 
Jeff "japhy" Pinyan      [EMAIL PROTECTED]      http://www.pobox.com/~japhy/
RPI Acacia brother #734   http://www.perlmonks.org/   http://www.cpan.org/
** Look for "Regular Expressions in Perl" published by Manning, in 2002 **
<stu> what does y/// stand for?  <tenderpuss> why, yansliterate of course.


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: summarizing a list of numbers?

Reply via email to