[jira] Updated: (PDFBOX-770) Greek text extraction

Manos Karampasis (JIRA) Sun, 04 Jul 2010 11:42:46 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Manos Karampasis updated PDFBOX-770:
------------------------------------

    Attachment: 3842.pdf
                3842.html

Ι have posted the original file and the output of extraction in html format

> Greek text extraction
> ---------------------
>
>                 Key: PDFBOX-770
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-770
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.0
>         Environment: Ubuntu 10.04
>            Reporter: Manos Karampasis
>         Attachments: 3842.html, 3842.pdf
>
>
> Greek text extraction error
> Ι have a greek pdf but after extraction the greek letter π is extracted as pi
> for expamle
> original text in pdf
> "φυσικών προσώπων"
> extracted text
> "φυσικών piροσώpiων"
> due to this problem solr is not indexing documents correctly
> is there any configuration I can make?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-770) Greek text extraction

Reply via email to