[jira] Updated: (PDFBOX-58) Problems with text extraction form Polish documents.

JIRA Sat, 02 Oct 2010 06:52:02 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-58?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andreas Lehmkühler updated PDFBOX-58:
-------------------------------------

    Attachment: polish.txt
                polish2.txt

I'm attaching the results extracted with the current trunk (rev. 1003396). It 
looks quiet perfect except the arrows from the wingdings font.

> Problems with text extraction form Polish documents.
> ----------------------------------------------------
>
>                 Key: PDFBOX-58
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-58
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>         Attachments: polish.pdf, polish.txt, polish2.pdf, polish2.txt
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1196559
> Originally submitted by jstrychowski on 2005-05-06 06:07.
> Hello, 
>  
> Thanks for the PDFBox. I uses this cool tool to convert 
> documents written in many languages. I have  problems with 
> some Polish documents. The Polish special characters are not 
> well converted in some cases. I attached an example document 
> (polish.pdf ? generated by the OpenOffice1.4) witch  illustrates 
> the problem. A PDFTextStriper class reads Polish characters 
> but in wrong order at the ?textList? list (see flushText() method). 
> These characters are placed at the end of each line of the text. A 
> x coordinate of the corresponding TextPosition elements are 
> valid so it is possible to reorder elements on the ?textList?. I 
> wrote correction of the PDFTextStripper class which solves the 
> problem. The fixed version is attached. Maybe similar 
> problems could occur for other languages or different PDF 
> documents. 
>  
> A second problem I met is also related with the Polish 
> characters. In some documents the TextPosition objects 
> corresponding to the Polish letters has width set to 0.0. The x 
> coordinates are valid in such situations so a document is 
> properly displayed but there are some errors during text 
> extraction. Some extra spaces occurs within the words. I 
> eliminated this problem increasing the word-space factor from 
> 0.5f to 0.65f in the flushText method. This correction cause 
> problems for some English documents (few words may be 
> joined) so I uses this correction only for documents written in  
> Polish. Is it possible to deal with this problem in more 
> sophisticated way ? A code like - if (isPolish) {...} - is not so 
> smart I suppose :-). I attached an example document 
> (polish2.pdf). My correction specialized for the Polish language 
> is available in the attached PDFTextStriper class. 
>  
> Thanks for help! 
> Jakub Strychowski 
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1196559&file_id=133255
> bug_polish_characters.tar.gz (application/x-tgz), 178409 bytes
> example documents and fixed class

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-58) Problems with text extraction form Polish documents.

Reply via email to