Re: How to not tokenize HTML tag from input string

Yonik Seeley Thu, 08 Feb 2007 13:12:33 -0800

On 2/8/07, Peter W. <[EMAIL PROTECTED]> wrote:

Using a parser to get text out of HTML, XML (including RSS, ATOM) is
only
easy if you control the source documents.


HTML pages in the wild are much different, generating exceptions you
must
catch and deal with.


Yes, that's why the Solr version isn't an HTML parser and assumes a mixed bag
of HTML, javascript, comments, processing instructions, and non-escaped text.
It doesn't throw exceptions because there are no "correctness" assumptions.

If you *know* you have well formed HTML documents, then you definitely
should parse (and extract important entities out to separate fields)
before sending to Lucene/Solr .

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How to not tokenize HTML tag from input string

Reply via email to