I don't see anything which jumps out at me here other than you're closing the document in 3 places where it's not needed (although that won't hurt anything). Your outer try seems unnecessary, and I can't see what isFailedFile would be set to when this code is called (or even where it is defined for that matter). The "finally" should always close the file (which means it'll always already be closed by the time you get to the catch statement, return statement, or at the end of the function). Your debug statements look like the way to go in order to track this one down.
Also, your regex would be simplier if you use the "case insensitive"flag. IIRC, it's used in Pattern.comple and is something like Pattern.CASE_INSENSITIVE, but you'll have to hit the docs to verify. --Adam Sophia Cheng <sophia.ch...@gmail.com> 09/29/2009 06:59 Please respond to pdfbox-users@incubator.apache.org To pdfbox-users@incubator.apache.org cc Subject Extracting text from a large (112) list of pdf files Hello, I'm using pdfbox to go through a list of pdfs and attempt to extract a phrase out of the files. Thus far, everything has been working great until I scaled it up to use a large list of files. I am running into a problem when there is a list of 112 files. After successfully going through a handful of them (maybe 20), subsequent files are not able to be opened. I have tried using the same method on just the offending files one at a time, and they are able to be opened. Here is the method that is being problematic, which is called for each pdf file: PDDocument doc = null; String regExDoi = "[Dd][Oo][Ii]:[0-9\\s]*\\.[\\ n\\r0-9]*/[A-Za-z0-9\\.\\-;\\(\\)/]*"; String regExDoiSplit = "[Dd][Oo][Ii]:"; Pattern findDoiString = Pattern.compile(regExDoi); try { try { doc = PDDocument.load(file); System.out.println("======= "+file.getName()+" loaded ========"); decrypt(doc); if (!isFailedFile) { PDFTextStripper strip = new PDFTextStripper(); int pageCount = doc.getNumberOfPages(); System.out.println("Pages: "+pageCount); for (int page = 1; page < pageCount; page++) { // restrict pdftextstripper to current page strip.setStartPage(page); strip.setEndPage(page); // get text on page String text = strip.getText(doc); // try to find the doi string Matcher m = findDoiString.matcher(text); if (m.find()) { String foundGroup = m.group(); String foundIt[] = foundGroup.split(regExDoiSplit); // split at regexDoiSplit, should be String[] = {"", "the doi numbers"} if (foundIt.length > 0) { System.out.println("\tDOI: '"+foundIt[1]+"'"); if (doc != null) { System.out.println("Closing document, found doi."); doc.close(); } // return the doi numbers, stripping any white space return foundIt[1].replaceAll("[\\s]*", ""); } } } } else System.out.println(isFailedFile + failedReason.toString()); } finally { if (doc != null) { doc.close(); } } } catch (IOException e) { isFailedFile = true; failedReason = FailedReason.BADFILE; if (doc != null) { doc.close(); } } if (doc != null) { doc.close(); System.out.println("close it again"); } return null } I think the problem is arising because I keep getting a "warning, you did not close the pdf" and in such a long list, after getting that warning so many times, it won't open the files anymore. I thought I closed the document at all points that needed to be closed, did I forget something else? Thank you. -Sophia -- ~~~~~~~~~~~~~~~~~~~~~~~~~ Aim for the moon. If you miss, you may hit a star. -W. Clement Stone ? Click here to submit conditions This email and any content within or attached hereto from Sun West Mortgage Company, Inc. is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or the taking of any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call (800) 453 7884.