[jira] [Comment Edited] (SOLR-4679) HTML line breaks (
) are removed during indexing; causes wrong search results

Uwe Schindler (JIRA) Thu, 08 Aug 2013 03:14:28 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733328#comment-13733328
 ]


Uwe Schindler edited comment on SOLR-4679 at 8/8/13 10:12 AM:
--------------------------------------------------------------

There is another occurence of this bug with PDF files (SOLR-5124). I think we 
should apply the workaround and make the ignoreable whitespace significant. In 
my opinion this is not a problem at all, because the Analyzer will remove this 
stuff in any case, so some additional whitespace would disappear.

bq. i did some experimenting and confirmed that the SolrContentHandler is 
getting ignorable whitespace SAX events for <br> tags in HTML – which makes no 
sense to me, so i've opened TIKA-1134 to try and get to the bottom of it.

I know this bug and I was discussing about that since the early beginning in 
TIKA and I don't think it will change! TIKA uses ignorable whitespace for all 
text-only glue stuff, which was decided at the beginning of the project. I can 
find the mail from their lists; I was involved in that, too (because I applied 
some fixes for that to "corectly produce" ignorable whitespace in some parsers, 
which were missing to do this. I also added the XHTMLContentHandler stuff that 
makes "block" XHTML elements like <p/>, <div/> also emit a newline as 
ignoreable on the closing element).

FYI: "ignoreable whitespace" is XML semantics only, in (X)HTML this does not 
exist (it is handled differently, but is never reported by HTML parsers), so 
the idea in TIKA is to "reuse" (its a bit "incorrect") the ignoreableWhitespace 
SAX event to report this "added whitespace". The rule that was choosen in TIKA 
is:
- If you ignore all elements of HTML and only extract plain text, use the 
ignoreable whitespace. This is e.g. done by the TIKA's internal wrappers that 
produce plain text (TextOnlyContentHandler). They treat all ignoreable 
whitepsace as significant. Ignoreable whitespace is *only* produced by TIKA so 
if it exists, you know that it is coming from TIKA.
- If you want to keep the XHTML structure and you "understand" block tags and 
<br/>, then you can ignore the ignorable whitespace.

Regarding this guideline, your patch is correct and should be applied to solr.
                
      was (Author: thetaphi):
    There is another occurence of this bug with PDF files (SOLR-5124). I think 
we should apply the workaround and make the ignoreable whitespace significant. 
In my opinion this is not a problem at all, because the Analyzer will remove 
this stuff in any case, so some additional whitespace would disappear.

bq. i did some experimenting and confirmed that the SolrContentHandler is 
getting ignorable whitespace SAX events for <br> tags in HTML – which makes no 
sense to me, so i've opened TIKA-1134 to try and get to the bottom of it.

I know this bug and I was discussing about that since the early beginning in 
TIKA and I don't think it will change! TIKA uses ignorable whitespace for all 
text-only glue stuff, which was decided at the beginning of the project. I can 
find the mail from their lists; I was involved in that, too (because I applied 
some fixes for that to "corectly produce" ignorable whitespace in some parsers, 
which were missing to do this).

FYI: "ignoreable whitespace" is XML semantics only, in (X)HTML this does not 
exist (it is handled differently, but is never reported by HTML parsers), so 
the idea in TIKA is to "reuse" (its a bit "incorrect") the ignoreableWhitespace 
SAX event to report this "added whitespace". The rule that was choosen in TIKA 
is:
- If you ignore all elements of HTML and only extract plain text, use the 
ignoreable whitespace. This is e.g. done by the TIKA's internal wrappers that 
produce plain text (TextOnlyContentHandler). They treat all ignoreable 
whitepsace as significant. Ignoreable whitespace is *only* produced by TIKA so 
if it exists, you know that it is coming from TIKA.
- If you want to keep the XHTML structure and you "understand" block tags and 
<br/>, then you can ignore the ignorable whitespace.

Regarding this guideline, your patch is correct and should be applied to solr.
                  
> HTML line breaks (<br>) are removed during indexing; causes wrong search 
> results
> --------------------------------------------------------------------------------
>
>                 Key: SOLR-4679
>                 URL: https://issues.apache.org/jira/browse/SOLR-4679
>             Project: Solr
>          Issue Type: Bug
>          Components: update
>    Affects Versions: 4.2
>         Environment: Windows Server 2008 R2, Java 6, Tomcat 7
>            Reporter: Christoph Straßer
>         Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
> Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png
>
>
> HTML line breaks (<br>, <BR>, <br/>, ...) seem to be removed during 
> extraction of content from HTML-Files. They need to be replaced with a empty 
> space.
> Test-File:
> <html>
> <head>
> <title>Test mit HTML-Zeilenschaltungen</title>
> </head>
> <p>
> word1<br>word2<br/>
> Some other words, a special name like linz<br>and another special name - 
> vienna
> </p>
> </html>
> The Solr-content-attribute contains the following text:
> Test mit HTML-Zeilenschaltungen    
> word1word2
> Some other words, a special name like linzand another special name - vienna
> So we are not able to find the word "linz".
> We use the ExtractingRequestHandler to put content into Solr. 
> (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-4679) HTML line breaks () are removed during indexing; causes wrong search results

Reply via email to

[jira] [Comment Edited] (SOLR-4679) HTML line breaks (
) are removed during indexing; causes wrong search results