[ 
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13734761#comment-13734761
 ] 

ASF subversion and git services commented on SOLR-5124:
-------------------------------------------------------

Commit 1512296 from [~thetaphi] in branch 'dev/trunk'
[ https://svn.apache.org/r1512296 ]

SOLR-4679, SOLR-4908, SOLR-5124: Text extracted from HTML or PDF files using 
Solr Cell was missing ignorable whitespace, which is inserted by TIKA for 
convenience to support plain text extraction without using the HTML elements. 
This bug resulted in glued words.
                
> Solr glues word´s when parsing PDFs under certan circumstances
> --------------------------------------------------------------
>
>                 Key: SOLR-5124
>                 URL: https://issues.apache.org/jira/browse/SOLR-5124
>             Project: Solr
>          Issue Type: Bug
>          Components: update
>    Affects Versions: 4.4
>         Environment: Windows 7 (don´t think, this is relevant)
>            Reporter: Christoph Straßer
>            Priority: Minor
>              Labels: tika,text-extraction
>         Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 
> 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 
> 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png
>
>
> For some kind of PDF-documents Solr glues words at linebreaks under some 
> circumstances. (eg the last word of line 1 and the first word of line 2 are 
> merged into one word)
> (Stand-alone-)Tika extracts the text correct. Attached you find one 
> sample-PDF and screenshots of tika-output and the corrupted content indexed 
> by solr.
> (This issue does not occur with all PDF-documents. Tried to recreate the 
> issue with new word-documents, I converted into PDF on multiple ways without 
> success.) The attached PDF-document has a real weird internal structure. But 
> Tika seems to do it´s work right. Even with this weird document.
> In our Solr-indices we have a good amount of this weird documents. This 
> results in worse suggestions by the Suggester.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to