[jira] Commented: (PDFBOX-547) problem in extracting text using PDFBox

Jignesh Sh (JIRA) Mon, 16 Nov 2009 04:08:05 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778306#action_12778306
 ]


Jignesh Sh commented on PDFBOX-547:
-----------------------------------

This issue is closed after I use the following 2 latest PDF box jar files
pdfbox-0.8.0-incubating.jar
fontbox-0.8.0-incubating.jar

Thanks,
Jignesh

> problem in extracting text using PDFBox
> ---------------------------------------
>
>                 Key: PDFBOX-547
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-547
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.7.0
>            Reporter: Jignesh Sh
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Hi All,
> I am facing problem in extracting text using PDFBox.
> Program hang at the line pdfText = stripper.getText(pdDoc); and returns 
> nothing.
> Actually I am using PDFBox version PDFBox-0.6.7a.jar
> Here is my code
> public String getPDFContent(ZipEntry pdfEntry)
>       {
>               boolean status = false;
>               String pdfText = null;
>                ZipIssueFactory issueFactory = null;
>                logger.debug("Processing : " + pdfEntry.getName());
>               COSDocument cosDoc = null;
>               PDDocument pdDoc = null;
>               try
>               {
>                       cosDoc = 
> parseDocument(zipFile.getInputStream(pdfEntry));      //  Load InputStream 
> into memory
>                
>                       // skipping the PDF document, if it is encrypted
>                       if (cosDoc.isEncrypted()) {
>                               logger.warn("Can not decrypt PDF document w/o 
> password, skipping:"+     pdfEntry.getName());
>                               return pdfText;
>                       }
>                       // extract PDF document's textual content
>                         pdDoc = new PDDocument(cosDoc);
>                         PDFTextStripper stripper = new PDFTextStripper();
>                         pdfText = stripper.getText(pdDoc);
>               }
>               catch (IOException e) {
>                 pdfText = null;
>                 logger.error("IOException in parsing PDF document: " + e);
>               }
>               finally{
>                       closeCOSDocument(cosDoc);
>                       closePDDocument(pdDoc);
>               }
>                return pdfText;
>       }
> private static COSDocument parseDocument(InputStream is) throws IOException {
>           PDFParser parser = new PDFParser(is);
>           parser.parse();
>           return parser.getDocument();
>        }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-547) problem in extracting text using PDFBox

Reply via email to