Hi Scott, HTMLStripCharFilter doesn't require that its input be valid HTML - there is no assumption of balanced tags.
Also, highlighted sections could span tags, e.g. if you highlight "this phrase", and the original HTML looks like: … this<span>phrase</span> … the highlighting code would have to know to put multiple tags to avoid non-wellformedness, maybe something like: … <b>this</b><span><b>phrase</b></span> … If you do develop a solution here, it would be great if you could share it with the community. Also, I think it would be useful to have an XML-specific stripping char filter - it's on my long term to-do list :). Steve On Nov 6, 2012, at 1:06 AM, Scott Smith <ssm...@mainstreamdata.com> wrote: > Since no one answered this, I decided I'd answer it myself (in case anyone > else wanted the answer). > > First, there are two types of filters you can use in an Analyzer -- Character > filters and token filters. Character filters get applied before tokenization > and token filters get applied after tokenization. > > So, my question was really nonsensical. The HTMLStripCharFilter is a > character filter and therefore gets applied to the html data before it goes > to the tokenizer. You can then apply any tokenizer you wish (including > StandardTokenizer). > > There is one caveat you might want to be aware of when using the > HTMLStripCharFilter and then highlighting search terms. Assume you strip the > html characters with the HTMLStripCharFilter and then use the standard > tokenizer. Now you run it through the highlighter. If there were other html > tags (besides whatever you are using for highlighting - <b> by default), then > you can have cases where your tags won't be properly nested. > > For example you could end up with: > > Now is <span class="underline">the <b>time</span></b> for all good men > to come... > > Note that the <b> isn't properly nested between the beginning and ending > span. For straight html, I would assume the browser will work it out. > However, if you are using xml, the document will become invalid. The problem > is that the html highlight code appears to place the ending tag (the </b>) > before the next word after the highlight term instead of after the marked > word ("time"). This means that if there are any html tags that the > HTMLStripCharFilter eliminated, the closing </b> will come after those > characters instead of before. > > Admittedly, you can make up cases where the highlighter will get it right, > but it appears to me that that only happens with phrases. For single words > (the more likely case), the closing highlighting sequence (</b>) should be > after the highlighted word. Regardless, it's impossible for the highlighter > to get it right all the time and you may have to write code that goes in and > fixes stuff up if you're using xml or your really anal about tags being > properly nested. > > Cheers > > Scott > > -----Original Message----- > From: Scott Smith [mailto:ssm...@mainstreamdata.com] > Sent: Thursday, November 01, 2012 7:16 PM > To: Michael Sokolov; java-user@lucene.apache.org > Subject: RE: Highlighting html pages > > I was trying to play with this. Am I correct in assuming that this isn't > going to work with the StandardTokenizer (since it appears to strip angle > brackets among other things)? Does HTMLStripCharFilter expect a > WhiteSpaceTokenizer or a CharacterTokenizer or ?? > > If I want to get rid of punctuation (commas, periods, semicolons, etc.) after > the HTML stripping, is there a filter? Essentially, I want to get it back to > what StandardTokenizer would give me after I've stripped the HTML. > > Suggestions? > > Scott > > -----Original Message----- > From: Michael Sokolov [mailto:soko...@ifactory.com] > Sent: Tuesday, October 23, 2012 9:04 PM > To: java-user@lucene.apache.org > Cc: Scott Smith > Subject: Re: Highlighting html pages > > If you use HTMLStripCharFilter, it extracts the text only, leaving tags out, > and remembering the word positions so that highlighting works properly. > Should do exactly what you want out of the box... > > > On 10/23/2012 8:00 PM, Scott Smith wrote: >> I need to take an html page that I retrieve from my lucene search and >> highlight all of the terms that are part of the search. I need to skip over >> any html tags since I don't want any words in tags which happen to match the >> search to be highlighted. >> >> Note that I don't want sections of the document. I need to highlight all >> terms in the document (with a <span> or something similar) and get back the >> entire document (with the new <span>s) so it can be displayed in its >> entirety with the search terms highlighted. >> >> Last time I did this (in the days of 1.4.2 - so a while ago), I had to write >> a custom tokenizer that skipped over the html tokens so that I didn't >> accidentally highlight them. I'm hoping that there is an easier way to do >> this now. >> >> Suggestions? >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org