Bugs item #745002, was opened at 2003-05-28 12:30 Message generated for change (Comment added) made by fdrake You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=745002&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Python Library Group: None Status: Open Resolution: None Priority: 5 Submitted By: Samuel Bayer (sambayer) Assigned to: Nobody/Anonymous (nobody) Summary: <> in attrs in sgmllib not handled Initial Comment: Hi folks - This bug is noted in the source code for sgmllib.py, and it finally bit me. If you feed the SGMLParser class text such as <tag attr = "<attrtag> bar </attrtag>">foo</tag> the <attrtag> will be processed as a tag, as well as being recognized as part of the attribute. This is because of the way the end index for the opening tag is computed. As far as I can tell from the HTML 4.01 specification, this is legal. The case I encountered was in a value of an "onmouseover" attribute, which was a Javascript call which contained HTML text as one of its arguments. The problem is in SGMLParser.parse_starttag, which attempts to compute the end of the opening tag with a simple regexp [<>], and uses this index even when the attributes have passed it. There's no real need to check this regexp in advance, as far as I can tell. I've attached my proposed modification of SGMLParser.parse_starttag; I've tested this change in 2.2.1, but there are no relevant differences between 2.2.1 and the head of the CVS tree for this method. No guarantees of correctness, but it works on the examples I've tested it on. Cheers - Sam Bayer ================================ w_endbracket = re.compile("\s*[<>]") class SGMLParser: # Internal -- handle starttag, return length or -1 if not terminated def parse_starttag(self, i): self.__starttag_text = None start_pos = i rawdata = self.rawdata if shorttagopen.match(rawdata, i): # SGML shorthand: <tag/data/ == <tag>data</tag> # XXX Can data contain &... (entity or char refs)? # XXX Can data contain < or > (tag characters)? # XXX Can there be whitespace before the first /? match = shorttag.match(rawdata, i) if not match: return -1 tag, data = match.group(1, 2) self.__starttag_text = '<%s/' % tag tag = tag.lower() k = match.end(0) self.finish_shorttag(tag, data) self.__starttag_text = rawdata[start_pos:match.end(1) + 1] return k # Now parse the data between i+1 and the end of the tag into a tag and attrs attrs = [] if rawdata[i:i+2] == '<>': # SGML shorthand: <> == <last open tag seen> k = i + 1 tag = self.lasttag else: match = tagfind.match(rawdata, i+1) if not match: self.error('unexpected call to parse_starttag') k = match.end(0) tag = rawdata[i+1:k].lower() self.lasttag = tag while w_endbracket.match(rawdata, k) is None: match = attrfind.match(rawdata, k) if not match: break attrname, rest, attrvalue = match.group(1, 2, 3) if not rest: attrvalue = attrname elif attrvalue[:1] == '\'' == attrvalue[-1:] or \ attrvalue[:1] == '"' == attrvalue[-1:]: attrvalue = attrvalue[1:-1] attrs.append((attrname.lower(), attrvalue)) k = match.end(0) match = endbracket.search(rawdata, k) if not match: return -1 j = match.start(0) if rawdata[j] == '>': j = j+1 self.__starttag_text = rawdata[start_pos:j] self.finish_starttag(tag, attrs) return j ---------------------------------------------------------------------- >Comment By: Fred L. Drake, Jr. (fdrake) Date: 2006-06-23 02:16 Message: Logged In: YES user_id=3066 See also: http://www.python.org/sf/803422 ---------------------------------------------------------------------- Comment By: Martin v. Löwis (loewis) Date: 2003-06-14 11:23 Message: Logged In: YES user_id=21627 I see. Can you please attach the fix as a context or unified diff to this report? I can't follow your changes above at all. ---------------------------------------------------------------------- Comment By: Samuel Bayer (sambayer) Date: 2003-06-14 09:35 Message: Logged In: YES user_id=40146 I'm reporting it because (a) it's not in the bug queue, and (b) it's broken The fact that it's noted as a bug in the source code doesn't strike me as relevant. Especially since I attached a fix. ---------------------------------------------------------------------- Comment By: Martin v. Löwis (loewis) Date: 2003-06-14 03:58 Message: Logged In: YES user_id=21627 If this is a known bug, why are you reporting it? ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=745002&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com