[
https://issues.apache.org/jira/browse/PDFBOX-770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Manos Karampasis updated PDFBOX-770:
------------------------------------
Description:
Greek text extraction error
Ι have a greek pdf but
a) after extraction the greek letter π is extracted as pi
for expamle
original text in pdf
"φυσικών προσώπων"
extracted text
"φυσικών piροσώpiων"
b) the greek letter μ is displayed as µ
there is no difference in display except that is different encoding and when
searching for μ cannot find it (you find only the uppercase Μ)
if you copy μ as displayed search for that is working fine
e.g. the word is displayed as "κλίµακας" but it is different from the typed
word κλίμακα due to the letter μ
due to this problem solr is not indexing documents correctly
is there any configuration I can make?
was:
Greek text extraction error
Ι have a greek pdf but after extraction the greek letter π is extracted as pi
for expamle
original text in pdf
"φυσικών προσώπων"
extracted text
"φυσικών piροσώpiων"
due to this problem solr is not indexing documents correctly
is there any configuration I can make?
Fix Version/s: 1.3.0
> Greek text extraction
> ---------------------
>
> Key: PDFBOX-770
> URL: https://issues.apache.org/jira/browse/PDFBOX-770
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.2.0
> Environment: Ubuntu 10.04
> Reporter: Manos Karampasis
> Fix For: 1.3.0
>
> Attachments: 3842.html, 3842.pdf
>
>
> Greek text extraction error
> Ι have a greek pdf but
> a) after extraction the greek letter π is extracted as pi
> for expamle
> original text in pdf
> "φυσικών προσώπων"
> extracted text
> "φυσικών piροσώpiων"
> b) the greek letter μ is displayed as µ
> there is no difference in display except that is different encoding and when
> searching for μ cannot find it (you find only the uppercase Μ)
> if you copy μ as displayed search for that is working fine
> e.g. the word is displayed as "κλίµακας" but it is different from the typed
> word κλίμακα due to the letter μ
> due to this problem solr is not indexing documents correctly
> is there any configuration I can make?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.