Orbel Mkrtchyan created PDFBOX-2053:
---------------------------------------
Summary: Issue with PDFBox position reading
Key: PDFBOX-2053
URL: https://issues.apache.org/jira/browse/PDFBOX-2053
Project: PDFBox
Issue Type: Bug
Affects Versions: 1.8.3
Reporter: Orbel Mkrtchyan
Using PDFBox 1.8.4,
bug #1:
PDDocument doc = new PDDocument();
doc.load("test-pcc7247.pdf");
doc.save("out.pdf");
doc.close();
The resulting file is corrupted, contains 0 pages and cannot be viewed by
Acrobat Reader.
bug #2: consider the following code snippet. The code runs like this:
Extractor extractor = new Extractor();
extractor.writeText(pdDoc, output);
Using the code defined like this:
public class Extractor extends PDFTextStripper {
...
protected void writePage() throws IOException
{
for( int i = 0; i < charactersByArticle.size(); i++)
{
List<TextPosition> textList = charactersByArticle.get( i );
Iterator textIter = textList.iterator();
while( textIter.hasNext() )
{
TextPosition position = (TextPosition)textIter.next();
In the given piece of code, position variable correctly iterates through the
letters of the first line of the provided pdf document, but its coordinates (x,
y, widths, etc) are always the same. Just to be clear, 1 position always
relates to 1 letter, and its widths array's length always equals 1. So we get
the same coordinates for every letter in a line. Expected behaviour is either
having new coordinates per letter or having widths[] contain widths for the
characters of a whole line of text
--
This message was sent by Atlassian JIRA
(v6.2#6252)