Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

shrinath.m Fri, 11 Mar 2011 05:33:06 -0800

On Fri, Mar 11, 2011 at 6:27 PM, Erick Erickson [via Lucene] <
[email protected]> wrote:


> Solr doesn't do it. There exist various tokenizers/filters that just strip
> the HTML tags, but there's nothing built into Solr that I know of that
> understands HTML, HTML-aware operations are outside Solr's purview.
>
>
This is how Solr achieve it :
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripStandardTokenizerFactory


-- 
Regards
Shrinath.M


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Which-is-the-best-fast-HTML-parser-tokenizer-that-I-can-use-with-Lucene-for-indexing-HTML-content-to-tp2664316p2664717.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

Reply via email to