[ 
https://issues.apache.org/jira/browse/PDFBOX-4367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676114#comment-16676114
 ] 

Tilman Hausherr edited comment on PDFBOX-4367 at 11/6/18 4:19 AM:
------------------------------------------------------------------

The "-force" option is a documentation leftover, there's a fresh issue about 
it, PDFBOX-4369.

There is no option to process each page by itself. Apache Tika (which uses 
PDFBox) has it, I know it because we discussed it maybe a year ago. I can't 
find it on [https://tika.apache.org/1.19.1/gettingstarted.html] so maybe it 
isn't available on the command line, only programmatically.

You could also modify ExtractText for yourself. You first get the number of 
pages (document.getNumberOfPages()), and then call `setStartPage()` and 
`setEndPage()` for each page and run `writeText()` several times. Note that the 
page numbers are 1-based here. (0-based at some other places)

I could also implement it for ExtractText, it would make sense for people who 
need this and can't change the code. Main problem is that I'd need a good name 
for the option. (Not "force").


was (Author: tilman):
The "-force" option is a documentation leftover, there's a fresh issue about 
it, PDFBOX-4369.

There is no option to process each page by itself. Apache Tika (which uses 
PDFBox) has it, I know it because we discussed it maybe a year ago. I can't 
find it on [https://tika.apache.org/1.19.1/gettingstarted.html] so maybe it 
isn't available on the command line, only programmatically.

You could also modify ExtractText for yourself. You first get the number of 
pages (document.getNumberOfPages()), and then call `setStartPage()` and 
`setEndPage()` for each page and run `writeText()` several times.

I could also implement it for ExtractText, it would make sense for people who 
need this and can't change the code. Main problem is that I'd need a good name 
for the option. (Not "force").

> Error expected floating point number actual='18-5'
> --------------------------------------------------
>
>                 Key: PDFBOX-4367
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4367
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.12
>         Environment: Mac OS X Sierra
>            Reporter: Peter Johnson
>            Priority: Minor
>
> Able to repeat with command line.  Unfortunately, the only files that repeat 
> this are from a customer, and contain sensitive information.  The file opens 
> without error in Acrobat Reader and Mac Preview.  The desired result is that 
> any corrupt portions of the PDF are skipped, so that we can use what text is 
> extractable.
> Unfortunately, I still get an error when using the -force option.
> We get the following stack trace:
> {code:java}
> C02V390UHTD6:Downloads pjohnson$ java -jar pdfbox-app-2.0.12.jar ExtractText 
> 16cccd9af5032a303774f7b87fb95076.pdf
> Nov 02, 2018 10:04:54 AM org.apache.pdfbox.pdfparser.BaseParser parseCOSArray
> WARNING: Corrupt object reference at offset 19727
> Exception in thread "main" java.io.IOException: Error expected floating point 
> number actual='18-5'
> at org.apache.pdfbox.cos.COSFloat.<init>(COSFloat.java:78)
> at org.apache.pdfbox.cos.COSNumber.get(COSNumber.java:110)
> at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:947)
> at org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:631)
> at 
> org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:174)
> at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:510)
> at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
> at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
> at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
> at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> at org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:237)
> at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:82)
> at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)
> Caused by: java.lang.NumberFormatException
> at java.math.BigDecimal.<init>(BigDecimal.java:494)
> at java.math.BigDecimal.<init>(BigDecimal.java:383)
> at java.math.BigDecimal.<init>(BigDecimal.java:806)
> at org.apache.pdfbox.cos.COSFloat.<init>(COSFloat.java:59)
> ... 14 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to