Hi Scott,

HTMLStripCharFilter doesn't require that its input be valid HTML - there is no 
assumption of balanced tags.  

Also, highlighted sections could span tags, e.g. if you highlight "this 
phrase", and the original HTML looks like:

        … this<span>phrase</span> …

the highlighting code would have to know to put multiple tags to avoid 
non-wellformedness, maybe something like:

        … <b>this</b><span><b>phrase</b></span> … 

If you do develop a solution here, it would be great if you could share it with 
the community.

Also, I think it would be useful to have an XML-specific stripping char filter 
- it's on my long term to-do list :).

Steve

On Nov 6, 2012, at 1:06 AM, Scott Smith <ssm...@mainstreamdata.com> wrote:

> Since no one answered this, I decided I'd answer it myself (in case anyone 
> else wanted the answer).
> 
> First, there are two types of filters you can use in an Analyzer -- Character 
> filters and token filters.  Character filters get applied before tokenization 
> and token filters get applied after tokenization.  
> 
> So, my question was really nonsensical.  The HTMLStripCharFilter is a 
> character filter and therefore gets applied to the html data before it goes 
> to the tokenizer.  You can then apply any tokenizer you wish (including 
> StandardTokenizer).
> 
> There is one caveat you might want to be aware of when using the 
> HTMLStripCharFilter and then highlighting search terms.  Assume you strip the 
> html characters with the HTMLStripCharFilter and then use the standard 
> tokenizer.  Now you run it through the highlighter.  If there were other html 
> tags (besides whatever you are using for highlighting - <b> by default), then 
> you can have cases where your tags won't be properly nested. 
> 
> For example you could end up with:
> 
>       Now is <span class="underline">the <b>time</span></b> for all good men 
> to come... 
> 
> Note that the <b> isn't properly nested between the beginning and ending 
> span.  For straight html, I would assume the browser will work it out.  
> However, if you are using xml, the document will become invalid.  The problem 
> is that the html highlight code appears to place the ending tag (the </b>) 
> before the next word after the highlight term instead of after the marked 
> word ("time").  This means that if there are any html tags that the 
> HTMLStripCharFilter eliminated, the closing </b> will come after those 
> characters instead of before.
> 
> Admittedly, you can make up cases where the highlighter will get it right, 
> but it appears to me that that only happens with phrases.  For single words 
> (the more likely case), the closing highlighting sequence (</b>) should be 
> after the highlighted word.  Regardless, it's impossible for the highlighter 
> to get it right all the time and you may have to write code that goes in and 
> fixes stuff up if you're using xml or your really anal about tags being 
> properly nested.
> 
> Cheers
> 
> Scott
> 
> -----Original Message-----
> From: Scott Smith [mailto:ssm...@mainstreamdata.com] 
> Sent: Thursday, November 01, 2012 7:16 PM
> To: Michael Sokolov; java-user@lucene.apache.org
> Subject: RE: Highlighting html pages
> 
> I was trying to play with this.  Am I correct in assuming that this isn't 
> going to work with the StandardTokenizer (since it appears to strip angle 
> brackets among other things)?  Does HTMLStripCharFilter expect a 
> WhiteSpaceTokenizer or a CharacterTokenizer or ??  
> 
> If I want to get rid of punctuation (commas, periods, semicolons, etc.) after 
> the HTML stripping, is there a filter?  Essentially, I want to get it back to 
> what StandardTokenizer would give me after I've stripped the HTML.
> 
> Suggestions?
> 
> Scott
> 
> -----Original Message-----
> From: Michael Sokolov [mailto:soko...@ifactory.com] 
> Sent: Tuesday, October 23, 2012 9:04 PM
> To: java-user@lucene.apache.org
> Cc: Scott Smith
> Subject: Re: Highlighting html pages
> 
> If you use HTMLStripCharFilter, it extracts the text only, leaving tags out, 
> and remembering the word positions so that highlighting works properly.  
> Should do exactly what you want out of the box...
> 
> 
> On 10/23/2012 8:00 PM, Scott Smith wrote:
>> I need to take an html page  that I retrieve from my lucene search and 
>> highlight all of the terms that are part of the search.  I need to skip over 
>> any html tags since I don't want any words in tags which happen to match the 
>> search to be highlighted.
>> 
>> Note that I don't want sections of the document.  I need to highlight all 
>> terms in the document (with a <span> or something similar) and get back the 
>> entire document (with the new <span>s) so it can be displayed in its 
>> entirety with the search terms highlighted.
>> 
>> Last time I did this (in the days of 1.4.2 - so a while ago), I had to write 
>> a custom tokenizer that skipped over the html tokens so that I didn't 
>> accidentally highlight them.  I'm hoping that there is an easier way to do 
>> this now.
>> 
>> Suggestions?
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to