[
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13734463#comment-13734463
]
Christoph Straßer commented on SOLR-5124:
-----------------------------------------
@Jack: No issue with odd unicode character. (Fiddler Raw View - Screenshot of
extractOnly=true attached.)
@Uwe: Big thanks for taking care of this issue! :-)
> Solr glues word´s when parsing PDFs under certan circumstances
> --------------------------------------------------------------
>
> Key: SOLR-5124
> URL: https://issues.apache.org/jira/browse/SOLR-5124
> Project: Solr
> Issue Type: Bug
> Components: update
> Affects Versions: 4.4
> Environment: Windows 7 (don´t think, this is relevant)
> Reporter: Christoph Straßer
> Priority: Minor
> Labels: tika,text-extraction
> Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png,
> 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png,
> 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png
>
>
> For some kind of PDF-documents Solr glues words at linebreaks under some
> circumstances. (eg the last word of line 1 and the first word of line 2 are
> merged into one word)
> (Stand-alone-)Tika extracts the text correct. Attached you find one
> sample-PDF and screenshots of tika-output and the corrupted content indexed
> by solr.
> (This issue does not occur with all PDF-documents. Tried to recreate the
> issue with new word-documents, I converted into PDF on multiple ways without
> success.) The attached PDF-document has a real weird internal structure. But
> Tika seems to do it´s work right. Even with this weird document.
> In our Solr-indices we have a good amount of this weird documents. This
> results in worse suggestions by the Suggester.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]