[jira] [Commented] (PDFBOX-1067) PDF Scan from Xerox WorkCentre 5030 renders as all black

Dave Smith (Commented) (JIRA) Sat, 10 Mar 2012 06:33:25 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226867#comment-13226867
 ]


Dave Smith commented on PDFBOX-1067:
------------------------------------

If you read the PDF spec (3.3.6 JBIG2Decode Filter) it explicitly says to strip 
the J2BIG header off and split the 0 segment into JBIG2Globals part and then 
drop the end of page and end of file segments. So ImageIO will never find the 
right type of reader because of the missing header. So we have to load the 
reader manually and insert the globals segment before the rest of the stream 
data. Here is a proof of concept that you can try with 
http://code.google.com/p/jbig2-imageio

in org.apache.pdfbox.filter.JBIG2Filter

new decode ...

@Override
        public void decode( InputStream compressedData, OutputStream result, 
COSDictionary options, int filterIndex )
        throws IOException
    {
        Iterator<ImageReader> readers = 
ImageIO.getImageReadersByFormatName("JBIG2");
        if (!readers.hasNext())
        {
            log.error( "Can't find an ImageIO plugin to decode the JBIG2 
encoded datastream.");
            return;
        }
        
        ImageReader reader = readers.next();
        
        COSDictionary decodeP = (COSDictionary) 
options.getDictionaryObject(COSName.DECODE_PARMS);
        COSStream st = (COSStream) 
decodeP.getDictionaryObject(COSName.getPDFName("JBIG2Globals"));
        
reader.setInput(ImageIO.createImageInputStream(JBIG2StreamMerge(st.getFilteredStream(),compressedData)));
 
        BufferedImage bi = reader.read(0);
        if ( bi != null )
        {
            DataBuffer dBuf = bi.getData().getDataBuffer();
            if ( dBuf.getDataType() == DataBuffer.TYPE_BYTE )
            {
                result.write( ( ( DataBufferByte ) dBuf ).getData() );
            }
            else
            {
                log.error( "Image data buffer not of type byte but type " + 
dBuf.getDataType() );
            }
        }
        else
        {
           log.error( "Something went wrong when decoding the JBIG2 encoded 
datastream.");
        }
    }

// ugly. Should use some sort of stream merge ...
 protected static InputStream JBIG2StreamMerge(InputStream globals,InputStream 
body)
        throws IOException
    {
                ByteArrayOutputStream out = new ByteArrayOutputStream();
                byte buf[] = new byte[1024];
                int read = globals.read(buf);
                while(read != -1)
                {
                        out.write(buf, 0, read);
                        read = globals.read(buf);
                }
                read = body.read(buf);
                while(read != -1)
                {
                        out.write(buf, 0, read);
                        read = body.read(buf);
                }
                out.close();
                return new ByteArrayInputStream(out.toByteArray());
 
    }

                
> PDF Scan from Xerox WorkCentre 5030 renders as all black
> --------------------------------------------------------
>
>                 Key: PDFBOX-1067
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1067
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: PDModel
>    Affects Versions: 1.6.0
>         Environment: Tested on MacOS X 10.6.7, Ubuntu 10.10, Windows 7
>            Reporter: Sarah Kelley
>         Attachments: ItDoesntWorkScan.pdf, sakelley_pdf_rendering_problem.zip
>
>
>     The file "ItDoesntWorkScan.pdf" renders to an empty
>     black page. This file is a copy of "ItDoesntWorkPrinted.pdf"
>     that has been printed on paper, and then scanned with
>     a Xerox WorkCentre 5030 scanner, which then emails a pdf file
>     back to the user.
>     Tested On:
>         - Mac OS 10.6
>         - Windows 7
>         - Ubuntu 10.10
>     Unfortunately, the WorkCentre 5030 doesn't appear to have
>     many user-settable options for scanning to PDF, so we weren't
>     really able to try scanning with settings other than the defaults.
> Will attach pdf and code to demonstrate.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1067) PDF Scan from Xerox WorkCentre 5030 renders as all black

Reply via email to