On Thu, 04 Apr 2019 18:13:43 +0200 Jonas Smedegaard <jo...@jones.dk> wrote: > control: tag -1 confirmed > > Quoting Sandro Mani (2019-04-04 13:36:28) > > $ wget > > https://files.pythonhosted.org/packages/source/x/xonsh/xonsh-0.8.12.tar.gz > > $ tar xf xonsh-0.8.12.tar.gz > > $ licensecheck xonsh-0.8.12/xonsh/parser_table.py > > > > => Licensecheck hangs eating cpu cycles (the file has lines with 33k and > > 71k characters). > > Indeed. Thanks for reporting! > > - Jonas > > -- > * Jonas Smedegaard - idealist & Internet-arkitekt > * Tlf.: +45 40843136 Website: http://dr.jones.dk/ > > [x] quote me freely [ ] ask before reusing [ ] keep private
Hi, I have been digging in the code (admittedly using the master branch of the libregexp-pattern-license-perl and licensecheck rather than the packages) and basically, it is a DOS from suboptimal regex. I traced it down to getting stuck on the python_2 "grant_license". This regex expands to (manually reformatted with /x for readability): """ m! (?^: (?: (?: (?:[Ll]icensed|[Rr]eleased) [ ] under|(?:according [ ] to|as [ ] governed [ ] by|under) [ ] the [ ] (?:conditions|terms) [ ] of)(?:(?:[Tt]he [ ] )?Python-2.0 | (?:[Tt]he [ ])?Python(?: [ ] [Ll]icense)? [ ] 2.0 | (?:[Tt]he [ ])?Python-2.0 | (?:[Tt]he [ ])?Python [ ] Software [ ] Foundation(?: [ ] [Ll]icense)? [ ] version [ ] 2 | (?:[Tt]he [ ])?python2 | (?:[Tt]he [ ])?Python-2 | (?:[Tt]he [ ])?PSF-2 | (?:[Tt]he [ ])?Python(?: [ ] [Ll]icense)? [ ] Version [ ] 2 | (?:[Tt]he [ ])?PYTHON [ ] SOFTWARE [ ] FOUNDATION [ ] LICENSE [ ] VERSION [ ] 2 | (?:[Tt]he [ ])?python-license-2.0) | (?:\W*\S+\W*)PSF [ ] is [ ] making [ ] Python [ ] available [ ] to [ ] Licensee ) ) !x """ The problem is the *last* alternative, namely: """ (?:\W*\S+\W*)PSF [ ] is [ ] making [ ] [...] """ That \W*\S+\W* (known as ${BB} in the libregexp-pattern-license-perl code) is stirring up hell. Basically, perl wants to find the *longest* match and will spent stupid amount of time in this "trivial" regex enumerating exponentially many "non-matches" ([1] strikes again). Simply removing ${BB} will make the code continue past the python_2 test relatively fast. For the python_2 case, I think that the phrase "PSF is making Python available to Licensee" would be sufficient enough to consider it a match (i.e. ${BB} is redundant) - though it will change behaviour on an anchored match (I hope this is not a problem). Though it then gets stuck in the next regex "cube" (from @L_type_unversioned) and that is as far down the rabbit hole I ventured in terms of regex getting stuck (note that "cube" indirectly uses the $BB regex too). Thanks, ~Niels [1] https://swtch.com/~rsc/regexp/regexp1.html