[ https://issues.apache.org/jira/browse/PDFBOX-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226867#comment-13226867 ]
Dave Smith commented on PDFBOX-1067: ------------------------------------ If you read the PDF spec (3.3.6 JBIG2Decode Filter) it explicitly says to strip the J2BIG header off and split the 0 segment into JBIG2Globals part and then drop the end of page and end of file segments. So ImageIO will never find the right type of reader because of the missing header. So we have to load the reader manually and insert the globals segment before the rest of the stream data. Here is a proof of concept that you can try with http://code.google.com/p/jbig2-imageio in org.apache.pdfbox.filter.JBIG2Filter new decode ... @Override public void decode( InputStream compressedData, OutputStream result, COSDictionary options, int filterIndex ) throws IOException { Iterator<ImageReader> readers = ImageIO.getImageReadersByFormatName("JBIG2"); if (!readers.hasNext()) { log.error( "Can't find an ImageIO plugin to decode the JBIG2 encoded datastream."); return; } ImageReader reader = readers.next(); COSDictionary decodeP = (COSDictionary) options.getDictionaryObject(COSName.DECODE_PARMS); COSStream st = (COSStream) decodeP.getDictionaryObject(COSName.getPDFName("JBIG2Globals")); reader.setInput(ImageIO.createImageInputStream(JBIG2StreamMerge(st.getFilteredStream(),compressedData))); BufferedImage bi = reader.read(0); if ( bi != null ) { DataBuffer dBuf = bi.getData().getDataBuffer(); if ( dBuf.getDataType() == DataBuffer.TYPE_BYTE ) { result.write( ( ( DataBufferByte ) dBuf ).getData() ); } else { log.error( "Image data buffer not of type byte but type " + dBuf.getDataType() ); } } else { log.error( "Something went wrong when decoding the JBIG2 encoded datastream."); } } // ugly. Should use some sort of stream merge ... protected static InputStream JBIG2StreamMerge(InputStream globals,InputStream body) throws IOException { ByteArrayOutputStream out = new ByteArrayOutputStream(); byte buf[] = new byte[1024]; int read = globals.read(buf); while(read != -1) { out.write(buf, 0, read); read = globals.read(buf); } read = body.read(buf); while(read != -1) { out.write(buf, 0, read); read = body.read(buf); } out.close(); return new ByteArrayInputStream(out.toByteArray()); } > PDF Scan from Xerox WorkCentre 5030 renders as all black > -------------------------------------------------------- > > Key: PDFBOX-1067 > URL: https://issues.apache.org/jira/browse/PDFBOX-1067 > Project: PDFBox > Issue Type: New Feature > Components: PDModel > Affects Versions: 1.6.0 > Environment: Tested on MacOS X 10.6.7, Ubuntu 10.10, Windows 7 > Reporter: Sarah Kelley > Attachments: ItDoesntWorkScan.pdf, sakelley_pdf_rendering_problem.zip > > > The file "ItDoesntWorkScan.pdf" renders to an empty > black page. This file is a copy of "ItDoesntWorkPrinted.pdf" > that has been printed on paper, and then scanned with > a Xerox WorkCentre 5030 scanner, which then emails a pdf file > back to the user. > Tested On: > - Mac OS 10.6 > - Windows 7 > - Ubuntu 10.10 > Unfortunately, the WorkCentre 5030 doesn't appear to have > many user-settable options for scanning to PDF, so we weren't > really able to try scanning with settings other than the defaults. > Will attach pdf and code to demonstrate. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira