[issue1486713] HTMLParser : A auto-tolerant parsing mode

Ezio Melotti Wed, 16 Nov 2011 05:17:00 -0800

Ezio Melotti <ezio.melo...@gmail.com> added the comment:

> 16ed15ff0d7c was not in current stable py3.2 so I missed it..


It's also in 3.2 and 2.7 (but it's quite recent, so if you didn't pull recently 
you might have missed it).

> When the comma is now raised as attribute name, then the problem is 
> anyway moved to the higher level anyway - and is/can be handled easily 
> there by usual methods.

The next level could/should validate the name of the attribute and determine 
that ',' is not a valid attribute name, so in this case there's no warning to 
raise here (actually you could detect that it's not a-zA-Z (or whatever the 
specs say) and raise a more general warning even at this level, but no 
information is lost here about this).

> 100% is not the point unless it shall drive the official W3C checker.

I'm still not sure that having 70-80% is useful (unless we can achieve 100% on 
this level and leave the rest to an upper layer).  If you think this is doable 
you could try to first identify what errors should be detected by this layer, 
see if they are all detectable and then propose a patch.

> The call of self.warning, as in old patch, doesn't cost otherwise and
> I see no real increase of complexity/cpu-time.

The extra complexity is mainly in the already complex regular expressions, and 
also in the list of 'if' that will have to check the content of the groups to 
report the warnings.  These changes are indeed not too invasive, but they still 
make the code more complicated.

> Almost any app which parses HTML (self authored or remote) can have 
> (should have?) a no-fuzz/collateral warn log option. (->no need to 
> make a expensive W3C checker session).

I think the original goal of HTMLParser was parsing mostly-valid HTML.  People 
started reporting issues with less-valid HTML, and these issues got fixed to 
make it able to parse non-valid HTML.  AFAIK it never followed strictly any 
HTML standard, and it just provided a best-effort way to get data out of an 
HTML page.  So, I would consider doing validation or even being a building 
block for a conforming parser out of the scope of the module.

> I mostly have this in use as said, as it was anyway there.

If 'this' refers to some kind of warning system, what do you do with these 
warnings?   Do you fix them, avoid using the w3c validator (or any other 
conforming validator) and consider a mostly-valid page good enough?  Or do you 
fix them, and then you also check with the w3c validator?

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue1486713>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1486713] HTMLParser : A auto-tolerant parsing mode

Reply via email to