[ 
https://issues.apache.org/jira/browse/SOLR-4908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13734760#comment-13734760
 ] 

ASF subversion and git services commented on SOLR-4908:
-------------------------------------------------------

Commit 1512296 from [~thetaphi] in branch 'dev/trunk'
[ https://svn.apache.org/r1512296 ]

SOLR-4679, SOLR-4908, SOLR-5124: Text extracted from HTML or PDF files using 
Solr Cell was missing ignorable whitespace, which is inserted by TIKA for 
convenience to support plain text extraction without using the HTML elements. 
This bug resulted in glued words.
                
> SolrContentHandler procuces glued words when extracting html
> ------------------------------------------------------------
>
>                 Key: SOLR-4908
>                 URL: https://issues.apache.org/jira/browse/SOLR-4908
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - Solr Cell (Tika extraction)
>    Affects Versions: 4.3
>         Environment: Windows 7, 64bit, Solr 4.3 example
>            Reporter: Markus Schuch
>         Attachments: tika-test.html
>
>
> The SolrContentHandler produces glued words when extracting html
> for html documents like:
> {code}
> <html><head></head><body>glued<br/>words</body></html>
> {code}
> This was solved in Tika [TIKA-343] but the problem occurs when using the 
> extraction handler because the SolrContentHandler discards 
> ignoreableWhitespace.
> The Tika XHTMLContentHandler issues ignoreableWhitspace events with a newline 
> in the character stream when a <br> tag is encountered.
> The SolrContentHandler should be modified to add the ignorable whitespace to 
> the content.
> Reproduction Steps:
> # POST the html example file from the attachtments to 
> http://localhost:8983/solr/update/extract?literal.id=html-test-1&commit=true 
> (e.g. with curl or HTTP Requester Plugin in Firefox)
> # Query for the document 
> http://localhost:8983/solr/collection1/select?q=id%3A%22html-test-1%22&fl=content&wt=xml&indent=true
> # Look for the field content, which contains the word "Shouldnotbeglued"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to