[
https://issues.apache.org/jira/browse/PDFBOX-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tom updated PDFBOX-1716:
------------------------
Description:
Sample
document(https://issues.apache.org/jira/secure/attachment/12430914/FormI-9-English.pdf)
can be found here https://issues.apache.org/jira/browse/PDFBOX-578. Looks the
NPE issue fix in that work item
https://issues.apache.org/jira/browse/PDFBOX-578 is a work around.
When I try to extract the text content from /FormI-9-English.pdf , when I call
PDDocument.getNumberOfPages(), this method return 0 which makes the extraction
of the text content impossible:
InputStream in = <PDF InputStream>
PDFParser parser = new PDFParser(content);
PDFTextStripper pdfStripper = null;
String parsedText = null;
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
for(int i=1; i<= pdDoc.getNumberOfPages(); i+)
{ // !!pdDoc.getNumberOfPages() return 0, which is incorrect!!
}
This problem is found in the PDFBox latest version 1.8.2
was:
Sample
document(https://issues.apache.org/jira/secure/attachment/12430914/FormI-9-English.pdf)
can be found here https://issues.apache.org/jira/browse/PDFBOX-578. Looks the
NPE issue fix in that work item
https://issues.apache.org/jira/browse/PDFBOX-578 is a work around.
When I try to extract the text content from , when I call
PDDocument.getNumberOfPages(), this method return 0 which makes the extraction
of the text content impossible:
InputStream in = <PDF InputStream>
PDFParser parser = new PDFParser(content);
PDFTextStripper pdfStripper = null;
String parsedText = null;
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
for(int i=1; i<= pdDoc.getNumberOfPages(); i+)
{ // !!pdDoc.getNumberOfPages() return 0, which is incorrect!!
}
This problem is found in the PDFBox latest version 1.8.2
> PDDocument.getNumberOfPages() return 0 for certain PDF document
> ---------------------------------------------------------------
>
> Key: PDFBOX-1716
> URL: https://issues.apache.org/jira/browse/PDFBOX-1716
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Reporter: Tom
> Fix For: 1.8.2
>
>
> Sample
> document(https://issues.apache.org/jira/secure/attachment/12430914/FormI-9-English.pdf)
> can be found here https://issues.apache.org/jira/browse/PDFBOX-578. Looks
> the NPE issue fix in that work item
> https://issues.apache.org/jira/browse/PDFBOX-578 is a work around.
> When I try to extract the text content from /FormI-9-English.pdf , when I
> call PDDocument.getNumberOfPages(), this method return 0 which makes the
> extraction of the text content impossible:
> InputStream in = <PDF InputStream>
> PDFParser parser = new PDFParser(content);
> PDFTextStripper pdfStripper = null;
> String parsedText = null;
> parser.parse();
> cosDoc = parser.getDocument();
> pdfStripper = new PDFTextStripper();
> pdDoc = new PDDocument(cosDoc);
>
> for(int i=1; i<= pdDoc.getNumberOfPages(); i+)
> { // !!pdDoc.getNumberOfPages() return 0, which is incorrect!!
>
> }
> This problem is found in the PDFBox latest version 1.8.2
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira