[jira] Commented: (PDFBOX-770) Greek text extraction

JIRA Tue, 08 Mar 2011 08:38:24 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004045#comment-13004045
 ]


Andreas Lehmkühler commented on PDFBOX-770:
-------------------------------------------

You should upgrade to the newest version 1.5.0 of PDFBox [1], see the attached 
result


[1] http://pdfbox.apache.org/download.html

> Greek text extraction
> ---------------------
>
>                 Key: PDFBOX-770
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-770
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.0, 1.2.1, 1.3.1
>         Environment: Ubuntu 10.04
>            Reporter: Manos Karampasis
>             Fix For: 1.5.0
>
>         Attachments: 3842.html, 3842.pdf, PDFBOX770-3842.txt
>
>
> Greek text extraction error
> Ι have a greek pdf but 
> a) after extraction the greek letter π is extracted as pi
> for expamle
> original text in pdf
> "φυσικών προσώπων"
> extracted text
> "φυσικών piροσώpiων"
> b) the greek letter μ is displayed as µ 
> there is no difference in display except that is different encoding and when 
> searching for μ cannot find it (you find only the uppercase Μ)
> if you copy  μ as displayed search for that is working fine
> e.g. the word is displayed as "κλίµακας" but it is different from the typed 
> word κλίμακα due to the letter μ
> due to this problem solr is not indexing documents correctly
> is there any configuration I can make?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PDFBOX-770) Greek text extraction

Reply via email to