Hi, I'm having trouble extracting text from a PDF. The main goal here is to upload PDFs and have them in our Lucene Index. This code has been working fine for 95 % of the PDFs our users upload. Unfortunately the PDF causing this error is quite large so I didn't attach it.
Anyway this is the error I'm getting: Caused by: java.io.IOException: Unknown colorspace type:null at org.pdfbox.pdmodel.graphics.color.PDColorSpaceFactory.createColorSpace(PDColorSpaceFactory.java:121) at org.pdfbox.pdmodel.PDResources.getColorSpaces(PDResources.java:264) at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:196) at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:177) at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:339) at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:263) at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:219) at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:152) And this is how I use PDFTextStripper in my code: InputStream is = DocumentServiceUtil.getFileAsStream(...); PDDocument document = PDDocument.load(is); int nbrPages = document.getNumberOfPages(); if (nbrPages > 0) { PDFTextStripper stripper = new PDFTextStripper(); stripper.setLineSeparator(" "); stripper.setPageSeparator(" "); List< IndexedCatalogPage > pages = new ArrayList< IndexedCatalogPage >(); for (int i = 1; i <= nbrPages; i++) { stripper.setStartPage(i); stripper.setEndPage(i); String text = stripper.getText(document); IndexedCatalogPage page = new IndexedCatalogPage(); page.setPageNumber(i); page.setText(text); pages.add(page); } ... Any help is greatly appreciated! Best Regards, Kim