Hi, 

I'm having trouble extracting text from a PDF. The main goal here is to upload 
PDFs and have them in our Lucene Index.
This code has been working fine for 95 % of the PDFs our users upload.
Unfortunately the PDF causing this error is quite large so I didn't attach it.

Anyway this is the error I'm getting:

Caused by: java.io.IOException: Unknown colorspace type:null
 at 
org.pdfbox.pdmodel.graphics.color.PDColorSpaceFactory.createColorSpace(PDColorSpaceFactory.java:121)
 at org.pdfbox.pdmodel.PDResources.getColorSpaces(PDResources.java:264)
 at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:196)
 at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:177)
 at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:339)
 at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:263)
 at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:219)
 at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:152)

And this is how I use PDFTextStripper in my code:
       
  InputStream is = DocumentServiceUtil.getFileAsStream(...);
  
        PDDocument document = PDDocument.load(is);
        int nbrPages = document.getNumberOfPages();
        if (nbrPages > 0) {
          PDFTextStripper stripper = new PDFTextStripper();
          stripper.setLineSeparator(" ");
          stripper.setPageSeparator(" ");
          List< IndexedCatalogPage > pages = new ArrayList< IndexedCatalogPage 
>();
          for (int i = 1; i <= nbrPages; i++) {
            stripper.setStartPage(i);
            stripper.setEndPage(i);
            String text = stripper.getText(document);
            IndexedCatalogPage page = new IndexedCatalogPage();
            page.setPageNumber(i);
            page.setText(text);
            pages.add(page);
          }
    ...
    
Any help is greatly appreciated!
Best Regards,
Kim
    

Reply via email to