[
https://issues.apache.org/jira/browse/PDFBOX-58?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andreas Lehmkühler updated PDFBOX-58:
-------------------------------------
Attachment: polish.txt
polish2.txt
I'm attaching the results extracted with the current trunk (rev. 1003396). It
looks quiet perfect except the arrows from the wingdings font.
> Problems with text extraction form Polish documents.
> ----------------------------------------------------
>
> Key: PDFBOX-58
> URL: https://issues.apache.org/jira/browse/PDFBOX-58
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Attachments: polish.pdf, polish.txt, polish2.pdf, polish2.txt
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1196559
> Originally submitted by jstrychowski on 2005-05-06 06:07.
> Hello,
>
> Thanks for the PDFBox. I uses this cool tool to convert
> documents written in many languages. I have problems with
> some Polish documents. The Polish special characters are not
> well converted in some cases. I attached an example document
> (polish.pdf ? generated by the OpenOffice1.4) witch illustrates
> the problem. A PDFTextStriper class reads Polish characters
> but in wrong order at the ?textList? list (see flushText() method).
> These characters are placed at the end of each line of the text. A
> x coordinate of the corresponding TextPosition elements are
> valid so it is possible to reorder elements on the ?textList?. I
> wrote correction of the PDFTextStripper class which solves the
> problem. The fixed version is attached. Maybe similar
> problems could occur for other languages or different PDF
> documents.
>
> A second problem I met is also related with the Polish
> characters. In some documents the TextPosition objects
> corresponding to the Polish letters has width set to 0.0. The x
> coordinates are valid in such situations so a document is
> properly displayed but there are some errors during text
> extraction. Some extra spaces occurs within the words. I
> eliminated this problem increasing the word-space factor from
> 0.5f to 0.65f in the flushText method. This correction cause
> problems for some English documents (few words may be
> joined) so I uses this correction only for documents written in
> Polish. Is it possible to deal with this problem in more
> sophisticated way ? A code like - if (isPolish) {...} - is not so
> smart I suppose :-). I attached an example document
> (polish2.pdf). My correction specialized for the Polish language
> is available in the attached PDFTextStriper class.
>
> Thanks for help!
> Jakub Strychowski
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1196559&file_id=133255
> bug_polish_characters.tar.gz (application/x-tgz), 178409 bytes
> example documents and fixed class
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.