Hi Scott,
HTMLStripCharFilter doesn't require that its input be valid HTML - there is no
assumption of balanced tags.
Also, highlighted sections could span tags, e.g. if you highlight "this
phrase", and the original HTML looks like:
… this<span>phrase</span> …
the highlighting code would have to know to put multiple tags to avoid
non-wellformedness, maybe something like:
… <b>this</b><span><b>phrase</b></span> …
If you do develop a solution here, it would be great if you could share it with
the community.
Also, I think it would be useful to have an XML-specific stripping char filter
- it's on my long term to-do list :).
Steve
On Nov 6, 2012, at 1:06 AM, Scott Smith <[email protected]> wrote:
> Since no one answered this, I decided I'd answer it myself (in case anyone
> else wanted the answer).
>
> First, there are two types of filters you can use in an Analyzer -- Character
> filters and token filters. Character filters get applied before tokenization
> and token filters get applied after tokenization.
>
> So, my question was really nonsensical. The HTMLStripCharFilter is a
> character filter and therefore gets applied to the html data before it goes
> to the tokenizer. You can then apply any tokenizer you wish (including
> StandardTokenizer).
>
> There is one caveat you might want to be aware of when using the
> HTMLStripCharFilter and then highlighting search terms. Assume you strip the
> html characters with the HTMLStripCharFilter and then use the standard
> tokenizer. Now you run it through the highlighter. If there were other html
> tags (besides whatever you are using for highlighting - <b> by default), then
> you can have cases where your tags won't be properly nested.
>
> For example you could end up with:
>
> Now is <span class="underline">the <b>time</span></b> for all good men
> to come...
>
> Note that the <b> isn't properly nested between the beginning and ending
> span. For straight html, I would assume the browser will work it out.
> However, if you are using xml, the document will become invalid. The problem
> is that the html highlight code appears to place the ending tag (the </b>)
> before the next word after the highlight term instead of after the marked
> word ("time"). This means that if there are any html tags that the
> HTMLStripCharFilter eliminated, the closing </b> will come after those
> characters instead of before.
>
> Admittedly, you can make up cases where the highlighter will get it right,
> but it appears to me that that only happens with phrases. For single words
> (the more likely case), the closing highlighting sequence (</b>) should be
> after the highlighted word. Regardless, it's impossible for the highlighter
> to get it right all the time and you may have to write code that goes in and
> fixes stuff up if you're using xml or your really anal about tags being
> properly nested.
>
> Cheers
>
> Scott
>
> -----Original Message-----
> From: Scott Smith [mailto:[email protected]]
> Sent: Thursday, November 01, 2012 7:16 PM
> To: Michael Sokolov; [email protected]
> Subject: RE: Highlighting html pages
>
> I was trying to play with this. Am I correct in assuming that this isn't
> going to work with the StandardTokenizer (since it appears to strip angle
> brackets among other things)? Does HTMLStripCharFilter expect a
> WhiteSpaceTokenizer or a CharacterTokenizer or ??
>
> If I want to get rid of punctuation (commas, periods, semicolons, etc.) after
> the HTML stripping, is there a filter? Essentially, I want to get it back to
> what StandardTokenizer would give me after I've stripped the HTML.
>
> Suggestions?
>
> Scott
>
> -----Original Message-----
> From: Michael Sokolov [mailto:[email protected]]
> Sent: Tuesday, October 23, 2012 9:04 PM
> To: [email protected]
> Cc: Scott Smith
> Subject: Re: Highlighting html pages
>
> If you use HTMLStripCharFilter, it extracts the text only, leaving tags out,
> and remembering the word positions so that highlighting works properly.
> Should do exactly what you want out of the box...
>
>
> On 10/23/2012 8:00 PM, Scott Smith wrote:
>> I need to take an html page that I retrieve from my lucene search and
>> highlight all of the terms that are part of the search. I need to skip over
>> any html tags since I don't want any words in tags which happen to match the
>> search to be highlighted.
>>
>> Note that I don't want sections of the document. I need to highlight all
>> terms in the document (with a <span> or something similar) and get back the
>> entire document (with the new <span>s) so it can be displayed in its
>> entirety with the search terms highlighted.
>>
>> Last time I did this (in the days of 1.4.2 - so a while ago), I had to write
>> a custom tokenizer that skipped over the html tokens so that I didn't
>> accidentally highlight them. I'm hoping that there is an easier way to do
>> this now.
>>
>> Suggestions?
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]