[
https://issues.apache.org/jira/browse/PDFBOX-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988602#comment-13988602
]
Tilman Hausherr commented on PDFBOX-2053:
-----------------------------------------
This is very similar to PDFBOX-62, although the fix I proposed there doesn't
work there, for a reason that I don't know yet.
> Issue with PDFBox position reading
> ----------------------------------
>
> Key: PDFBOX-2053
> URL: https://issues.apache.org/jira/browse/PDFBOX-2053
> Project: PDFBox
> Issue Type: Bug
> Affects Versions: 1.8.3
> Reporter: Orbel Mkrtchyan
> Attachments: test.pdf
>
>
> Using PDFBox 1.8.4,
> bug #1:
> PDDocument doc = new PDDocument();
> doc.load("test-pcc7247.pdf");
> doc.save("out.pdf");
> doc.close();
> The resulting file is corrupted, contains 0 pages and cannot be viewed by
> Acrobat Reader.
> bug #2: consider the following code snippet. The code runs like this:
> Extractor extractor = new Extractor();
> extractor.writeText(pdDoc, output);
> Using the code defined like this:
> public class Extractor extends PDFTextStripper {
> ...
> protected void writePage() throws IOException
> {
> for( int i = 0; i < charactersByArticle.size(); i++)
> {
> List<TextPosition> textList = charactersByArticle.get( i );
> Iterator textIter = textList.iterator();
> while( textIter.hasNext() )
> {
> TextPosition position = (TextPosition)textIter.next();
> In the given piece of code, position variable correctly iterates through the
> letters of the first line of the provided pdf document, but its coordinates
> (x, y, widths, etc) are always the same. Just to be clear, 1 position always
> relates to 1 letter, and its widths array's length always equals 1. So we get
> the same coordinates for every letter in a line. Expected behaviour is either
> having new coordinates per letter or having widths[] contain widths for the
> characters of a whole line of text
--
This message was sent by Atlassian JIRA
(v6.2#6252)