PDFBox image extraction fails with an ArrayOutOfBoundsException in 
PDPixelMap.getRGBImage()
-------------------------------------------------------------------------------------------

                 Key: PDFBOX-574
                 URL: https://issues.apache.org/jira/browse/PDFBOX-574
             Project: PDFBox
          Issue Type: Bug
          Components: Utilities
    Affects Versions: 0.8.0-incubator
         Environment: Java
            Reporter: Ian Kaplan



The project that I'm working on has been using PDFBox for both text extraction 
and image extraction from PDF documents.  We wrote a class, PDFImageStripper, 
which extends PDFStreamEngine:

{code:Java}
public class PDFImageStripper extends PDFStreamEngine 
{code}

{code:Java}
        public List<ExtractedImage> getImages(PDDocument document, String 
documentFilename, File targetDirectory) throws IOException {
                resetEngine();
                
                this.document = document;
                this.documentFilename = documentFilename;
                this.targetDirectory = targetDirectory;
        
        currentImageNumber = 1;
        
        images.clear();
        writeImages();        
        return images;
    }
{code}

{code:Java}
        private void writeImages() throws IOException {
                List<PDPage> pages = (List<PDPage>) 
document.getDocumentCatalog().getAllPages();
                for (PDPage page : pages) {
                    if (page != null) {
                        processStream(page, page.findResources(), 
page.getContents().getStream());
                    }
            }
        }
{code}

The call chain is shown below:

{noformat}
None.decode(byte[], byte[]) line: 57    
PDPixelMap.getRGBImage() line: 182      
PDPixelMap.write2OutputStream(OutputStream) line: 209   
PDPixelMap(PDXObjectImage).write2file(File) line: 142   
PDFImageStripper.saveImage(PDXObjectImage, String, File) line: 208      
PDFImageStripper.processOperator(PDFOperator, List) line: 155   
PDFImageStripper(PDFStreamEngine).processSubStream(PDPage, PDResources, 
COSStream) line: 229    
PDFImageStripper(PDFStreamEngine).processStream(PDPage, PDResources, COSStream) 
line: 188       
PDFImageStripper.writeImages() line: 113        
{noformat}

There is an ArrayOutOfBoundsException in the decode method.  The decode method 
is nothing more than a wrapper for a call to System.arraycopy():

{code:Java}
    public void decode(byte[] src, byte[] dest)
    {
        System.arraycopy(src,0,dest,0,src.length);
    }
{code}

The problem is, the source array is larger than the destination array.  This is 
show (from the Eclipse debugger) below:

{noformat}
src     byte[455112]  (id=171)  
dest    byte[435456]  (id=175)  
{noformat}

The code that seems to be causing the problem is shown below.  The branch that 
this bug shows up on is the LZW_DECODE branch.  Note that in the other code 
branch, the code makes sure that there is no size problem.

{code:Java}
            if( predictor < 10 ||
                filters == null || !(filters.contains( 
COSName.LZW_DECODE.getName()) ||
                         filters.contains( COSName.FLATE_DECODE.getName()) ) )
            {
                PredictorAlgorithm filter = 
PredictorAlgorithm.getFilter(predictor);
                filter.setWidth(width);
                filter.setHeight(height);
                filter.setBpp((bpc * 3) / 8);
                filter.decode(array, bufferData);
            }
            else
            {
                System.arraycopy( array, 0,bufferData, 0, 
                        (array.length<bufferData.length?array.length: 
bufferData.length) );
            }
{code}

One fix may be to simply change the code as follows (again, recall that the 
"decode" method is nothing but a wrapper for System.arraycopy()):

{code:Java}
          if( predictor < 10 ||
                filters == null || !(filters.contains( 
COSName.LZW_DECODE.getName()) ||
                         filters.contains( COSName.FLATE_DECODE.getName()) ) )
            {
                PredictorAlgorithm filter = 
PredictorAlgorithm.getFilter(predictor);
                filter.setWidth(width);
                filter.setHeight(height);
                filter.setBpp((bpc * 3) / 8);
            }
                System.arraycopy( array, 0,bufferData, 0, 
                        (array.length<bufferData.length?array.length: 
bufferData.length) );
{code}

If Jira allows me to attach a file that causes this problem I will do so.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to