Bugs item #1504333, was opened at 2006-06-11 08:58 Message generated for change (Comment added) made by haepal You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1504333&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Python Library Group: Python 2.4 Status: Open Resolution: None Priority: 5 Private: No Submitted By: Sam Ruby (rubys) Assigned to: Nobody/Anonymous (nobody) Summary: sgmllib should allow angle brackets in quoted values Initial Comment: Real live example (search for "other<br />corrections") http://latticeqcd.blogspot.com/2006/05/non-relativistic-qcd.html This addresses the following (included in the file): # XXX The following should skip matching quotes (' or ") ---------------------------------------------------------------------- Comment By: Haejoong Lee (haepal) Date: 2007-01-11 13:01 Message: Logged In: YES user_id=135609 Originator: NO Could someone check if the following patch fixes the problem? This patch was made against revision 51854. --- sgmllib.py.org 2006-11-06 02:31:12.000000000 -0500 +++ sgmllib.py 2007-01-11 12:39:30.000000000 -0500 @@ -16,6 +16,35 @@ # Regular expressions used for parsing +class MyMatch: + def __init__(self, i): + self._i = i + def start(self, i): + return self._i + +class EndBracket: + def search(self, data, index): + s = data[index:] + bs = None + quote = None + for i,c in enumerate(s): + if bs: + bs = False + else: + if c == '<' or c == '>': + if quote is None: + break + elif c == "'" or c == '"': + if c == quote: + quote = None + else: + quote = c + elif c == '\\': + bs = True + else: + return None + return MyMatch(i+index) + interesting = re.compile('[&<]') incomplete = re.compile('&([a-zA-Z][a-zA-Z0-9]*|#[0-9]*)?|' '<([a-zA-Z][^<>]*|' @@ -29,7 +58,8 @@ shorttagopen = re.compile('<[a-zA-Z][-.a-zA-Z0-9]*/') shorttag = re.compile('<([a-zA-Z][-.a-zA-Z0-9]*)/([^/]*)/') piclose = re.compile('>') -endbracket = re.compile('[<>]') +#endbracket = re.compile('[<>]') +endbracket = EndBracket() tagfind = re.compile('[a-zA-Z][-_.a-zA-Z0-9]*') attrfind = re.compile( r'\s*([a-zA-Z_][-:.a-zA-Z_0-9]*)(\s*=\s*' ---------------------------------------------------------------------- Comment By: Neal Norwitz (nnorwitz) Date: 2006-09-11 00:26 Message: Logged In: YES user_id=33168 I reverted the patch and added the test case for sgml so the infinite loop doesn't recur. This was mentioned several times on python-dev. Committed revision 51854. (head) Committed revision 51850. (2.5) Committed revision 51853. (2.4) ---------------------------------------------------------------------- Comment By: Fred L. Drake, Jr. (fdrake) Date: 2006-06-29 13:17 Message: Logged In: YES user_id=3066 I checked in a modified version of this patch: changed to use separate REs for start and end tags to reduce matching cost for end tags; extended tests; updated to avoid breaking previous changes to support IPv6 addresses in unquoted attribute values. Committed as revisions 47154 (trunk) and 47155 (release24-maint). ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1504333&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com