[issue41748] HTMLParser: parsing error

STINNER Victor Wed, 09 Sep 2020 07:14:12 -0700


STINNER Victor <vstin...@python.org> added the comment:


HTMLParser.check_for_whole_start_tag() uses locatestarttagend_tolerant regular 
expression to find the end of the start tag. This regex cuts the string at the 
first comma (","), but not if the comma is the first character of an attribute 
name

* '<div id="test" , color="blue">' => '<div id="test" , color="blue"': OK!
* '<div id="test" ,color="blue">' => '<div id="test" ,' => BUG

The regex is quite complex:

locatestarttagend_tolerant = re.compile(r"""
  <[a-zA-Z][^\t\n\r\f />\x00]*       # tag name
  (?:[\s/]*                          # optional whitespace before attribute name
    (?:(?<=['"\s/])[^\s/>][^\s/=>]*  # attribute name
      (?:\s*=+\s*                    # value indicator
        (?:'[^']*'                   # LITA-enclosed value
          |"[^"]*"                   # LIT-enclosed value
          |(?!['"])[^>\s]*           # bare value
         )
         (?:\s*,)*                   # possibly followed by a comma
       )?(?:\s|/(?!>))*
     )*
   )?
  \s*                                # trailing whitespace
""", re.VERBOSE)
endendtag = re.compile('>')

The problem is that this part of the regex:

#(?:\s*,)*                   # possibly followed by a comma

The comma is not seen as part of the attribute name.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue41748>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue41748] HTMLParser: parsing error

Reply via email to