[jira] Commented: (PDFBOX-548) IOException in extracting text using PDFBox

Jignesh Sh (JIRA) Mon, 16 Nov 2009 04:10:08 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778311#action_12778311
 ]


Jignesh Sh commented on PDFBOX-548:
-----------------------------------

Thanks,
latest 0.8 version PDFbox jar files solves this issue.

> IOException in extracting text using PDFBox
> -------------------------------------------
>
>                 Key: PDFBOX-548
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-548
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.7.0
>            Reporter: Jignesh Sh
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Hi All,
> I am facing IOException in extracting text using PDFBox. PDF file I am trying 
> to read is NOT password protected.
> Program throws IOException at the line 
> pdfText = stripper.getText(pdDoc); 
> Actually I am using PDFBox version PDFBox-0.6.7a.jar
> Here is my code
> public String getPDFContent(ZipEntry pdfEntry)
>       {
>               boolean status = false;
>               String pdfText = null;
>                ZipIssueFactory issueFactory = null;
>                logger.debug("Processing : " + pdfEntry.getName());
>               COSDocument cosDoc = null;
>               PDDocument pdDoc = null;
>               try
>               {
>                       cosDoc = 
> parseDocument(zipFile.getInputStream(pdfEntry));                
>                       // skipping the PDF document, if it is encrypted
>                       if (cosDoc.isEncrypted()) {
>                               logger.warn("Can not decrypt PDF document w/o 
> password, skipping:"+     pdfEntry.getName());
>                               return pdfText;
>                       }
>                       // extract PDF document's textual content
>                         pdDoc = new PDDocument(cosDoc);
>                         PDFTextStripper stripper = new PDFTextStripper();
>                         pdfText = stripper.getText(pdDoc); // THIS LINE 
> THROWS IOException
>               }
>               catch (IOException e) {
>                 pdfText = null;
>                 logger.error("IOException in parsing PDF document: " + e);
>               }
>               finally{
>                       closeCOSDocument(cosDoc);
>                       closePDDocument(pdDoc);
>               }
>                return pdfText;
>       }
> private static COSDocument parseDocument(InputStream is) throws IOException {
>           PDFParser parser = new PDFParser(is);
>           parser.parse();
>           return parser.getDocument();
>        }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-548) IOException in extracting text using PDFBox

Reply via email to