Solr doesn't do it. There exist various tokenizers/filters that just strip the HTML tags, but there's nothing built into Solr that I know of that understands HTML, HTML-aware operations are outside Solr's purview.
Best Erick On Fri, Mar 11, 2011 at 6:50 AM, shrinath.m <[email protected]> wrote: > On Fri, Mar 11, 2011 at 5:06 PM, Li Li [via Lucene] < > [email protected]> wrote: > >> But I think the parser will most be used when crawling. So you can use >> these parsers when crawling and save parsed result only. >> > > Consider we've offline HTML pages, no parsing while crawling, now what ? > Any tokenizer someone has built for this ? > > > How does Solr do it ? > > > -- > Regards > Shrinath.M > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Which-is-the-best-fast-HTML-parser-tokenizer-that-I-can-use-with-Lucene-for-indexing-HTML-content-to-tp2664316p2664411.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
