Re: Extracting text from a large (112) list of pdf files

Adam Tue, 29 Sep 2009 08:56:17 -0700

I don't see anything which jumps out at me here other than you're closing 
the document in 3 places where it's not needed (although that won't hurt 
anything).  Your outer try seems unnecessary, and I can't see what 
isFailedFile would be set to when this code is called (or even where it is 
defined for that matter).  The "finally" should always close the file 
(which means it'll always already be closed by the time you get to the 
catch statement, return statement, or at the end of the function).  Your 
debug statements look like the way to go in order to track this one down.


Also, your regex would be simplier if you use the "case insensitive"flag. 
IIRC, it's used in Pattern.comple and is something like 
Pattern.CASE_INSENSITIVE, but you'll have to hit the docs to verify.

--Adam




Sophia Cheng <sophia.ch...@gmail.com> 
09/29/2009 06:59
Please respond to
pdfbox-users@incubator.apache.org


To
pdfbox-users@incubator.apache.org
cc

Subject
Extracting text from a large (112) list of pdf files






Hello,

I'm using pdfbox to go through a list of pdfs and attempt to extract a
phrase out of the files.  Thus far, everything has been working great
until I scaled it up to use a large list of files. I am running into a
problem when there is a list of 112 files.  After successfully going
through a handful of them (maybe 20), subsequent files are not able to
be opened.  I have tried using the same method on just the offending
files one at a time, and they are able to be opened.

Here is the method that is being problematic, which is called for each pdf 
file:

        PDDocument doc = null;
        String     regExDoi      = "[Dd][Oo][Ii]:[0-9\\s]*\\.[\\
n\\r0-9]*/[A-Za-z0-9\\.\\-;\\(\\)/]*";
        String    regExDoiSplit = "[Dd][Oo][Ii]:";
        Pattern findDoiString = Pattern.compile(regExDoi);

        try {
            try {
                doc = PDDocument.load(file);
                System.out.println("======= "+file.getName()+" loaded
========");
                decrypt(doc);

                if (!isFailedFile) {
                    PDFTextStripper strip = new PDFTextStripper();
                    int pageCount = doc.getNumberOfPages();
                    System.out.println("Pages: "+pageCount);

                    for (int page = 1; page < pageCount; page++) {
                        // restrict pdftextstripper to current page
                        strip.setStartPage(page);
                        strip.setEndPage(page);

                        // get text on page
                        String text = strip.getText(doc);

                        // try to find the doi string
                        Matcher m = findDoiString.matcher(text);
                        if (m.find()) {
                            String foundGroup = m.group();
                            String foundIt[] = 
foundGroup.split(regExDoiSplit);
                            // split at regexDoiSplit, should be
String[] = {"", "the doi numbers"}
                            if (foundIt.length > 0) {
                                System.out.println("\tDOI: 
'"+foundIt[1]+"'");

                                if (doc != null) {
                                    System.out.println("Closing
document, found doi.");
                                    doc.close();
                                }
                               // return the doi numbers, stripping
any white space

                                return foundIt[1].replaceAll("[\\s]*", 
"");
                            }
                        }
                    }
                } else System.out.println(isFailedFile +
failedReason.toString());
            } finally {
                if (doc != null) {
                    doc.close();
                }
            }
        } catch (IOException e) {
            isFailedFile = true;
            failedReason = FailedReason.BADFILE;
            if (doc != null) {
                doc.close();
            }
        }

        if (doc != null) {
            doc.close();
            System.out.println("close it again");
        }
       return null
}

I think the problem is arising because I keep getting a "warning, you
did not close the pdf" and in such a long list, after getting that
warning so many times, it won't open the files anymore.  I thought I
closed the document at all points that needed to be closed, did I
forget something else?  Thank you.

-Sophia

-- 
~~~~~~~~~~~~~~~~~~~~~~~~~
Aim for the moon. If you miss, you may hit a star. -W. Clement Stone


?  Click here to submit conditions  

This email and any content within or attached hereto from  Sun West Mortgage 
Company, Inc.  is confidential and/or legally privileged. The information is 
intended only for the use of the individual or entity named on this email. If 
you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution or the taking of any action in reliance on 
the contents of this email information is strictly prohibited, and that the 
documents should be returned to this office immediately by email. Receipt by 
anyone other than the intended recipient is not a waiver of any privilege. 
Please do not include your social security number, account number, or any other 
personal or financial information in the content of the email. Should you have 
any questions, please call  (800) 453 7884.

Re: Extracting text from a large (112) list of pdf files

Reply via email to