[jira] [Resolved] (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Steven Rowe (Resolved) (JIRA) Tue, 24 Jan 2012 07:57:05 -0800

     [ 
https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Steven Rowe resolved SOLR-42.
-----------------------------

       Resolution: Fixed
    Fix Version/s: 4.0
                   3.6
         Assignee: Steven Rowe

Fixed by LUCENE-3690.
                
> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Steven Rowe
>            Priority: Minor
>             Fix For: 3.6, 4.0
>
>         Attachments: HTMLStripReaderTest.java, 
> HtmlStripReaderTestXmlProcessing.patch, 
> HtmlStripReaderTestXmlProcessing.patch, SOLR-42.patch, SOLR-42.patch, 
> SOLR-42.patch, SOLR-42.patch, TokenPrinter.java, htmlStripReaderTest.html
>
>
> Indexing content that contains HTML markup, causes problems with highlighting 
> if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names 
> from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a 
> polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has 
> the <em> tags in the wrong place - 22 characters to the left of where they 
> should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Reply via email to