[
https://issues.apache.org/jira/browse/PDFBOX-981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matt England updated PDFBOX-981:
--------------------------------
Attachment: example.pdf
Example pdf file which fails with standard 1.5.0 but passes with included
patch. Using PDFTextStripper like so:
(new PDFTextStripper()).getText(PDDocument.load(new
FileInputStream("example.pdf")))
> PDColorspaceFactory does not recognize colorspace DeviceGray (patch included
> herein)
> ------------------------------------------------------------------------------------
>
> Key: PDFBOX-981
> URL: https://issues.apache.org/jira/browse/PDFBOX-981
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.5.0
> Reporter: Matt England
> Labels: pdfbox
> Attachments: PDColorSpaceFactory.java.diff, example.pdf
>
>
> I was trying to use PDFTextStripper to extract text from a large corpus of
> PDF files. In some of them, the method:
> org.apache.pdfbox.pdmodel.graphics.color.PDColorSpaceFactory.createColorSpace(
> COSBase colorSpace, Map colorSpaces )
> fails to recognize the case when the colorSpace argument is of type COSArray
> and the array's (first) element corresponds to COSName.DEVICEGRAY. Adding
> that case successfully parses the files that failed with the stock
> pdfbox-1.5.0. Below is a diff of my patched PDColorSpaceFactory that handles
> the case where the colorspace name is DeviceGray. Incidentally, it occurs to
> me that another (possibly better) approach is to call through to
> createColorSpace(String) when no other case matches.
> % diff PDColorSpaceFactory.java.orig PDColorSpaceFactory.java
> 94a95,97
> > else if ( type.getName().equals( PDDeviceGray.NAME) ) {
> > retval = new PDDeviceGray();
> > }
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira