https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7225

            Bug ID: 7225
           Summary: A regexp for parsing an IPv4 address inconsistently
                    allows/disallows a leading zero
           Product: Spamassassin
           Version: 3.4.1
          Hardware: All
                OS: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Libraries
          Assignee: [email protected]
          Reporter: [email protected]

While reading section 3.2.2 of RFC 3986 (URI Generic Syntax)
I noticed that the formal syntax for IPv4address (dec-octet)
differs from a regexp as used in four code chunks in SpamAssassin.

Compare:

  dec-octet =
      DIGIT                 ; 0-9
    / %x31-39 DIGIT         ; 10-99
    / "1" 2DIGIT            ; 100-199
    / "2" %x30-34 DIGIT     ; 200-249
    / "25" %x30-35          ; 250-255

with our code:

  use constant IPV4_ADDRESS => qr/\b
    (?:1\d\d|2[0-4]\d|25[0-5]|\d\d|\d)\.
    (?:1\d\d|2[0-4]\d|25[0-5]|\d\d|\d)\.
    (?:1\d\d|2[0-4]\d|25[0-5]|\d\d|\d)\.
    (?:1\d\d|2[0-4]\d|25[0-5]|\d\d|\d)
    \b/ox;

which should have been:

  use constant IPV4_ADDRESS => qr/\b
    (?:1\d\d|2[0-4]\d|25[0-5]|[1-9]\d|\d)\.
    (?:1\d\d|2[0-4]\d|25[0-5]|[1-9]\d|\d)\.
    (?:1\d\d|2[0-4]\d|25[0-5]|[1-9]\d|\d)\.
    (?:1\d\d|2[0-4]\d|25[0-5]|[1-9]\d|\d)
    \b/ox;

Notice that currently SpamAssassin prohibits a leading zero
in 3-digit octets, but allows it in a 2-digit octet. This
is in the least inconsistent, but more likely just wrong
(considering the seeming intent of carefully checking for
a valid range of values instead of just taking any 1..3
decimal digit octets).

A quad-dotted IP address with leading zeroes is ambiguous
(octal vs. decimal) and it only serves to confuse software
that processes it. As it happens the early documents did not
strictly specify a textual representation of an IPv4 address
and implementations happened to differ in interpretation of
octets with leading zeroes. 

See:
  https://en.wikipedia.org/wiki/Dot-decimal_notation

as well as the:
  https://tools.ietf.org/html/draft-main-ipaddr-text-rep-02
which states:

| Meanwhile, a very popular implementation of IP networking went off
| in its own direction.  4.2BSD introduced a function inet_aton(),
| [...] It also allowed some
| flexibility in how the individual numeric parts were specified: it
| allowed octal and hexadecimal in addition to decimal, distinguishing
| these radices by using the C language syntax involving a prefix "0"
| or "0x", and allowed the numbers to be arbitrarily long.
|
| The 4.2BSD inet_aton() has been widely copied and imitated, and so is
| a de facto standard for the textual representation of IPv4 addresses.
| Nevertheless, these alternative syntaxes have now fallen out of use
| (if they ever had significant use).  The only practical use that they
| now see is for deliberate obfuscation of addresses [...]
[...]
| The most recent version of the URI syntax [URI] attempts to reconcile
| these variants in order to give a precise definition for acceptable
| IP address syntax in a URL.  (Its predecessors had incorporated the
| traditionally ambiguous syntax by reference.)  [URI] is the first RFC
| to require a completely rigorous definition of IP address syntax.
| The approach taken was to standardise the safe common subset of the
| IETF and BSD syntaxes, which achieves standardisation on IETF-like
| syntax while also retaining backward compatibility with existing BSD-
| based implementations.

So - in the interest of self-consistency, the parsing regexp should
be fixed (or relaxed to allow any numbers).

-- 
You are receiving this mail because:
You are the assignee for the bug.

Reply via email to