[jira] [Commented] (PDFBOX-2053) Issue with PDFBox position reading

Tilman Hausherr (JIRA) Fri, 02 May 2014 23:52:25 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988602#comment-13988602
 ]


Tilman Hausherr commented on PDFBOX-2053:
-----------------------------------------

This is very similar to PDFBOX-62, although the fix I proposed there doesn't 
work there, for a reason that I don't know yet.

> Issue with PDFBox position reading
> ----------------------------------
>
>                 Key: PDFBOX-2053
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2053
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 1.8.3
>            Reporter: Orbel Mkrtchyan
>         Attachments: test.pdf
>
>
> Using PDFBox 1.8.4,
> bug #1:
>               PDDocument doc = new PDDocument();
>               doc.load("test-pcc7247.pdf");
>               doc.save("out.pdf");
>               doc.close();
> The resulting file is corrupted, contains 0 pages and cannot be viewed by 
> Acrobat Reader.
> bug #2: consider the following code snippet. The code runs like this:
>       Extractor extractor = new Extractor();
>       extractor.writeText(pdDoc, output);
> Using the code defined like this:
> public class Extractor extends PDFTextStripper {
> ...
>     protected void writePage() throws IOException
>     {
>         for( int i = 0; i < charactersByArticle.size(); i++)
>         {
>             List<TextPosition> textList = charactersByArticle.get( i );
>             Iterator textIter = textList.iterator();
>             while( textIter.hasNext() )
>             {
>                 TextPosition position = (TextPosition)textIter.next();
> In the given piece of code, position variable correctly iterates through the 
> letters of the first line of the provided pdf document, but its coordinates 
> (x, y, widths, etc) are always the same. Just to be clear, 1 position always 
> relates to 1 letter, and its widths array's length always equals 1. So we get 
> the same coordinates for every letter in a line. Expected behaviour is either 
> having new coordinates per letter or having widths[] contain widths for the 
> characters of a whole line of text



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PDFBOX-2053) Issue with PDFBox position reading

Reply via email to