Kevin Burton wrote:
I'm not sure what you mean here, but I'd reccomend against using an
XML parser against web content and instead use something like the
HTML5 parsing algorithm [#html5-parsing].
Yes... I'm just trying to avoid using a full HTML parser (DOM or not)
to avoid garbage generation and processor overhead.
I use a streaming (SAX-like) HTML5 parser every day; because it's
defined in terms of the underlying state-machine, it's actually quite a
bit faster than what I had been using. Furthermore, many edge cases that
might otherwise have gone unnoticed are dealt with cleanly.
There a bigger problems that you'll face if you're indexing content,
e.g. encoding issues. Tokenizing HTML shouldn't be one of them.
~Derrick
_______________________________________________
microformats-discuss mailing list
[email protected]
http://microformats.org/mailman/listinfo/microformats-discuss