Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

Erick Erickson Fri, 11 Mar 2011 04:57:10 -0800

Solr doesn't do it. There exist various tokenizers/filters that just strip
the HTML tags, but there's nothing built into Solr that I know of that
understands HTML, HTML-aware operations are outside Solr's purview.


Best
Erick

On Fri, Mar 11, 2011 at 6:50 AM, shrinath.m <[email protected]> wrote:
> On Fri, Mar 11, 2011 at 5:06 PM, Li Li [via Lucene] <
> [email protected]> wrote:
>
>>   But I think the parser will most be used when crawling. So you can use
>> these parsers when crawling and save parsed result only.
>>
>
> Consider we've offline HTML pages, no parsing while crawling, now what ?
> Any tokenizer someone has built for this ?
>
>
> How does Solr do it ?
>
>
> --
> Regards
> Shrinath.M
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Which-is-the-best-fast-HTML-parser-tokenizer-that-I-can-use-with-Lucene-for-indexing-HTML-content-to-tp2664316p2664411.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

Reply via email to