On Thu, 04 Apr 2019 18:13:43 +0200 Jonas Smedegaard <jo...@jones.dk> wrote:
> control: tag -1 confirmed
> 
> Quoting Sandro Mani (2019-04-04 13:36:28)
> > $ wget 
> > https://files.pythonhosted.org/packages/source/x/xonsh/xonsh-0.8.12.tar.gz
> > $ tar xf xonsh-0.8.12.tar.gz
> > $ licensecheck xonsh-0.8.12/xonsh/parser_table.py
> > 
> > => Licensecheck hangs eating cpu cycles (the file has lines with 33k and 
> > 71k characters).
> 
> Indeed. Thanks for reporting!
> 
>  - Jonas
> 
> -- 
>  * Jonas Smedegaard - idealist & Internet-arkitekt
>  * Tlf.: +45 40843136  Website: http://dr.jones.dk/
> 
>  [x] quote me freely  [ ] ask before reusing  [ ] keep private

Hi,

I have been digging in the code (admittedly using the master branch of
the libregexp-pattern-license-perl and licensecheck rather than the
packages) and basically, it is a DOS from suboptimal regex.

I traced it down to getting stuck on the python_2 "grant_license".  This
regex expands to (manually reformatted with /x for readability):

"""
m!
(?^:
    (?:
        (?: (?:[Ll]icensed|[Rr]eleased) [ ] under|(?:according [ ] to|as
            [ ] governed [ ] by|under) [ ] the [ ] (?:conditions|terms)
            [ ] of)(?:(?:[Tt]he [ ] )?Python-2.0

      | (?:[Tt]he [ ])?Python(?: [ ] [Ll]icense)? [ ] 2.0
      | (?:[Tt]he [ ])?Python-2.0
      | (?:[Tt]he [ ])?Python [ ] Software [ ]
            Foundation(?: [ ] [Ll]icense)? [ ] version [ ] 2
      | (?:[Tt]he [ ])?python2
      | (?:[Tt]he [ ])?Python-2
      | (?:[Tt]he [ ])?PSF-2
      | (?:[Tt]he [ ])?Python(?: [ ] [Ll]icense)? [ ] Version [ ] 2
      | (?:[Tt]he [ ])?PYTHON [ ] SOFTWARE [ ] FOUNDATION [ ] LICENSE [
] VERSION [ ] 2
      | (?:[Tt]he [ ])?python-license-2.0)
      | (?:\W*\S+\W*)PSF [ ] is [ ] making [ ] Python [ ] available [ ]
to [ ] Licensee

    )

)
!x
"""

The problem is the *last* alternative, namely:

"""
  (?:\W*\S+\W*)PSF [ ] is [ ] making [ ] [...]
"""


That \W*\S+\W* (known as ${BB} in the libregexp-pattern-license-perl
code) is stirring up hell. Basically, perl wants to find the *longest*
match and will spent stupid amount of time in this "trivial" regex
enumerating exponentially many "non-matches" ([1] strikes again).

Simply removing ${BB} will make the code continue past the python_2 test
relatively fast.   For the python_2 case, I think that the phrase "PSF
is making Python available to Licensee" would be sufficient enough to
consider it a match (i.e. ${BB} is redundant) - though it will change
behaviour on an anchored match (I hope this is not a problem).


Though it then gets stuck in the next regex "cube" (from
@L_type_unversioned) and that is as far down the rabbit hole I ventured
in terms of regex getting stuck (note that "cube" indirectly uses the
$BB regex too).

Thanks,
~Niels

[1] https://swtch.com/~rsc/regexp/regexp1.html

Reply via email to