Re: [uf-discuss] stats on well formed XHTML

Derrick Lyndon Pallas Wed, 16 Jan 2008 19:42:01 -0800

Kevin Burton wrote:

I'm not sure what you mean here, but I'd reccomend against using an
XML parser against web content and instead use something like the
HTML5 parsing algorithm [#html5-parsing].


Yes... I'm just trying to avoid using a full HTML parser (DOM or not)
to avoid garbage generation and processor overhead.

I use a streaming (SAX-like) HTML5 parser every day; because it'sdefined in terms of the underlying state-machine, it's actually quite abit faster than what I had been using. Furthermore, many edge cases thatmight otherwise have gone unnoticed are dealt with cleanly.

There a bigger problems that you'll face if you're indexing content,e.g. encoding issues. Tokenizing HTML shouldn't be one of them.


~Derrick

_______________________________________________
microformats-discuss mailing list
[email protected]
http://microformats.org/mailman/listinfo/microformats-discuss

Re: [uf-discuss] stats on well formed XHTML

Reply via email to