[
https://issues.apache.org/jira/browse/PDFBOX-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984050#comment-13984050
]
Tilman Hausherr commented on PDFBOX-2048:
-----------------------------------------
Change committed in the trunk in rev 1590873, and rev 1590874 in the 1.8 branch.
Jonas, you can find a new jar file at
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-war/1.8.6-SNAPSHOT/
within a few hours. However it will be a few months before this will be
released officially.
I will set to resolve after the release of 1.8.5 (which will not include this
change, because the cut was already done).
> TextExtraction only working after uncompressing with pdftk
> ----------------------------------------------------------
>
> Key: PDFBOX-2048
> URL: https://issues.apache.org/jira/browse/PDFBOX-2048
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing, Rendering, Text extraction
> Affects Versions: 2.0.0
> Reporter: Tilman Hausherr
> Assignee: Tilman Hausherr
>
> From Jonas Karlsson on the user list:
> ===
> We have a user with PDFs generated by a commercial transcription service.
> When we try to extract text from these pdfs, pdfbox returns a few empty
> lines. We get this result both from our own code, and when using the
> ExtractText command line tool
> If I specify the non-sequential parser, with the -nonSeq flag, the
> following error is produced:
> Apr 28, 2014 10:35:11 AM org.apache.pdfbox.pdfparser.NonSequentialPDFParser
> validateStreamLength
> SEVERE: The end of the stream doesn't point to the correct offset, using
> workaround to read the stream
> If I uncompress the file with pdftk, pdfbox is able to successfully extract
> the text.
> ===
> I have been given permission to attach the file "committers only". So don't
> pass it around, avoid quoting details from the file. The file is also not
> rendering. The lengths of the streams are 0.
--
This message was sent by Atlassian JIRA
(v6.2#6252)