[jira] [Created] (PDFBOX-2053) Issue with PDFBox position reading

Orbel Mkrtchyan (JIRA) Fri, 02 May 2014 02:39:26 -0700

Orbel Mkrtchyan created PDFBOX-2053:
---------------------------------------


             Summary: Issue with PDFBox position reading
                 Key: PDFBOX-2053
                 URL: https://issues.apache.org/jira/browse/PDFBOX-2053
             Project: PDFBox
          Issue Type: Bug
    Affects Versions: 1.8.3
            Reporter: Orbel Mkrtchyan


Using PDFBox 1.8.4,
bug #1:
                PDDocument doc = new PDDocument();
                doc.load("test-pcc7247.pdf");
                doc.save("out.pdf");
                doc.close();

The resulting file is corrupted, contains 0 pages and cannot be viewed by 
Acrobat Reader.


bug #2: consider the following code snippet. The code runs like this:
      Extractor extractor = new Extractor();
      extractor.writeText(pdDoc, output);

Using the code defined like this:

public class Extractor extends PDFTextStripper {
...
    protected void writePage() throws IOException
    {
        for( int i = 0; i < charactersByArticle.size(); i++)
        {
            List<TextPosition> textList = charactersByArticle.get( i );
            Iterator textIter = textList.iterator();
            while( textIter.hasNext() )
            {
                TextPosition position = (TextPosition)textIter.next();

In the given piece of code, position variable correctly iterates through the 
letters of the first line of the provided pdf document, but its coordinates (x, 
y, widths, etc) are always the same. Just to be clear, 1 position always 
relates to 1 letter, and its widths array's length always equals 1. So we get 
the same coordinates for every letter in a line. Expected behaviour is either 
having new coordinates per letter or having widths[] contain widths for the 
characters of a whole line of text



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (PDFBOX-2053) Issue with PDFBox position reading

Reply via email to