[jira] [Commented] (SOLR-4679) HTML line breaks (
) are removed during indexing; causes wrong search results

Hoss Man (JIRA) Thu, 08 Aug 2013 11:15:03 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733791#comment-13733791
 ]


Hoss Man commented on SOLR-4679:
--------------------------------

bq. Because you are still not convinced with my argumentation, let me 
recapitulate TIKA's problems:

I never said that ... you said "I can take the issue if you like." and you 
explained why the existing patch should be committed -- i'm totally willing to 
go along with that, so have at it.  it seems sketchy to me, but if that's the 
way Tika works that's the way tika works, you certainly understand it better 
then me, so i defer to your assesment.

(as mentioned in TIKA-1134 it would be nice if this type of behavior was better 
documented for people implementing their own ContentHandlers, but that's a Tika 
issue not a Solr issue.)
                
> HTML line breaks (<br>) are removed during indexing; causes wrong search 
> results
> --------------------------------------------------------------------------------
>
>                 Key: SOLR-4679
>                 URL: https://issues.apache.org/jira/browse/SOLR-4679
>             Project: Solr
>          Issue Type: Bug
>          Components: update
>    Affects Versions: 4.2
>         Environment: Windows Server 2008 R2, Java 6, Tomcat 7
>            Reporter: Christoph Straßer
>            Assignee: Uwe Schindler
>         Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
> Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png
>
>
> HTML line breaks (<br>, <BR>, <br/>, ...) seem to be removed during 
> extraction of content from HTML-Files. They need to be replaced with a empty 
> space.
> Test-File:
> <html>
> <head>
> <title>Test mit HTML-Zeilenschaltungen</title>
> </head>
> <p>
> word1<br>word2<br/>
> Some other words, a special name like linz<br>and another special name - 
> vienna
> </p>
> </html>
> The Solr-content-attribute contains the following text:
> Test mit HTML-Zeilenschaltungen    
> word1word2
> Some other words, a special name like linzand another special name - vienna
> So we are not able to find the word "linz".
> We use the ExtractingRequestHandler to put content into Solr. 
> (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-4679) HTML line breaks () are removed during indexing; causes wrong search results

Reply via email to

[jira] [Commented] (SOLR-4679) HTML line breaks (
) are removed during indexing; causes wrong search results