[jira] [Commented] (PDFBOX-2048) TextExtraction only working after uncompressing with pdftk

Tilman Hausherr (JIRA) Mon, 28 Apr 2014 22:59:28 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984050#comment-13984050
 ]


Tilman Hausherr commented on PDFBOX-2048:
-----------------------------------------

Change committed in the trunk in rev 1590873, and rev 1590874 in the 1.8 branch.

Jonas, you can find a new jar file at 
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-war/1.8.6-SNAPSHOT/
within a few hours. However it will be a few months before this will be 
released officially.

I will set to resolve after the release of 1.8.5 (which will not include this 
change, because the cut was already done).

> TextExtraction only working after uncompressing with pdftk
> ----------------------------------------------------------
>
>                 Key: PDFBOX-2048
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2048
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing, Rendering, Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Tilman Hausherr
>            Assignee: Tilman Hausherr
>
> From Jonas Karlsson on the user list:
> ===
> We have a user with PDFs generated by a commercial transcription service.
> When we try to extract text from these pdfs, pdfbox returns a few empty
> lines. We get this result both from our own code, and when using the
> ExtractText command line tool
> If I specify the non-sequential parser, with the -nonSeq flag, the
> following error is produced:
> Apr 28, 2014 10:35:11 AM org.apache.pdfbox.pdfparser.NonSequentialPDFParser
> validateStreamLength
> SEVERE: The end of the stream doesn't point to the correct offset, using
> workaround to read the stream
> If I uncompress the file with pdftk, pdfbox is able to successfully extract
> the text.
> ===
> I have been given permission to attach the file "committers only". So don't 
> pass it around, avoid quoting details from the file. The file is also not 
> rendering. The lengths of the streams are 0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PDFBOX-2048) TextExtraction only working after uncompressing with pdftk

Reply via email to