Extracting text from a large (112) list of pdf files

Sophia Cheng Tue, 29 Sep 2009 07:00:07 -0700

Hello,

I'm using pdfbox to go through a list of pdfs and attempt to extract a
phrase out of the files.  Thus far, everything has been working great
until I scaled it up to use a large list of files. I am running into a
problem when there is a list of 112 files.  After successfully going
through a handful of them (maybe 20), subsequent files are not able to
be opened.  I have tried using the same method on just the offending
files one at a time, and they are able to be opened.


Here is the method that is being problematic, which is called for each pdf file:

        PDDocument doc = null;
        String     regExDoi      = "[Dd][Oo][Ii]:[0-9\\s]*\\.[\\
n\\r0-9]*/[A-Za-z0-9\\.\\-;\\(\\)/]*";
        String    regExDoiSplit = "[Dd][Oo][Ii]:";
        Pattern findDoiString = Pattern.compile(regExDoi);

        try {
            try {
                doc = PDDocument.load(file);
                System.out.println("======= "+file.getName()+" loaded
========");
                decrypt(doc);

                if (!isFailedFile) {
                    PDFTextStripper strip = new PDFTextStripper();
                    int pageCount = doc.getNumberOfPages();
                    System.out.println("Pages: "+pageCount);

                    for (int page = 1; page < pageCount; page++) {
                        // restrict pdftextstripper to current page
                        strip.setStartPage(page);
                        strip.setEndPage(page);

                        // get text on page
                        String text = strip.getText(doc);

                        // try to find the doi string
                        Matcher m = findDoiString.matcher(text);
                        if (m.find()) {
                            String foundGroup = m.group();
                            String foundIt[] = foundGroup.split(regExDoiSplit);
                            // split at regexDoiSplit, should be
String[] = {"", "the doi numbers"}
                            if (foundIt.length > 0) {
                                System.out.println("\tDOI: '"+foundIt[1]+"'");

                                if (doc != null) {
                                    System.out.println("Closing
document, found doi.");
                                    doc.close();
                                }
                               // return the doi numbers, stripping
any white space

                                return foundIt[1].replaceAll("[\\s]*", "");
                            }
                        }
                    }
                } else System.out.println(isFailedFile +
failedReason.toString());
            } finally {
                if (doc != null) {
                    doc.close();
                }
            }
        } catch (IOException e) {
            isFailedFile = true;
            failedReason = FailedReason.BADFILE;
            if (doc != null) {
                doc.close();
            }
        }

        if (doc != null) {
            doc.close();
            System.out.println("close it again");
        }
       return null
}

I think the problem is arising because I keep getting a "warning, you
did not close the pdf" and in such a long list, after getting that
warning so many times, it won't open the files anymore.  I thought I
closed the document at all points that needed to be closed, did I
forget something else?  Thank you.

-Sophia

-- 
~~~~~~~~~~~~~~~~~~~~~~~~~
Aim for the moon. If you miss, you may hit a star. -W. Clement Stone

Extracting text from a large (112) list of pdf files

Reply via email to