[jira] [Created] (PDFBOX-4151) FlateFilter, LZWFilter causes double memory usage

Marek Pribula (JIRA) Tue, 13 Mar 2018 07:41:17 -0700

Marek Pribula created PDFBOX-4151:
-------------------------------------

             Summary: FlateFilter, LZWFilter causes double memory usage
                 Key: PDFBOX-4151
                 URL: https://issues.apache.org/jira/browse/PDFBOX-4151
             Project: PDFBox
          Issue Type: Bug
            Reporter: Marek Pribula



The problem occurred in our production during processing file with size 400kB. 
The file was generated by the scanner with resolution 5960 x 8430 pixels with 8 
bit per pixel (unfortunately we have no control over files which should be 
processed). Our analysis showed that problem is in FlateFilter.decode where 
uncompressed data are written into ByteArrayOutputStream. Since the final size 
of the file is unknown to OutputStream its size is growing by internal call 
Arrays.copyOf. By the end of processing file, this leads to usage of memory at 
two times file size.

What we have tried and helped in our case was slightly modification of 
FlateFilter and LZWFilter decode method implementation. Here is the code 
snippet of original method body:
{code:java}
@Override
public DecodeResult decode(InputStream encoded, OutputStream decoded,
COSDictionary parameters, int index) throws IOException
{
int predictor = -1;

final COSDictionary decodeParams = getDecodeParams(parameters, index);
if (decodeParams != null)
{
predictor = decodeParams.getInt(COSName.PREDICTOR);
}

try
{
if (predictor > 1)
{
int colors = Math.min(decodeParams.getInt(COSName.COLORS, 1), 32);
int bitsPerPixel = decodeParams.getInt(COSName.BITS_PER_COMPONENT, 8);
int columns = decodeParams.getInt(COSName.COLUMNS, 1);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
decompress(encoded, baos);
ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray());
Predictor.decodePredictor(predictor, colors, bitsPerPixel, columns, bais, 
decoded);
decoded.flush();
baos.reset();
bais.reset();
}
else
{
decompress(encoded, decoded);
}
} 
catch (DataFormatException e)
{
// if the stream is corrupt a DataFormatException may occur
LOG.error("FlateFilter: stop reading corrupt stream due to a 
DataFormatException");

// re-throw the exception
throw new IOException(e);
}
return new DecodeResult(parameters);
}
{code}
and here is our implementation:
{code:java}
@Override
    public DecodeResult decode(InputStream encoded, OutputStream decoded,
                                         COSDictionary parameters, int index) 
throws IOException
    {
        final COSDictionary decodeParams = getDecodeParams(parameters, index);
        int predictor = decodeParams.getInt(COSName.PREDICTOR);

        try
        {
            if (predictor > 1)
            {
                File tempFile = null;
                FileOutputStream fos = null;
                FileInputStream fis = null;
                try {
                        int colors = 
Math.min(decodeParams.getInt(COSName.COLORS, 1), 32);
                        int bitsPerPixel = 
decodeParams.getInt(COSName.BITS_PER_COMPONENT, 8);
                        int columns = decodeParams.getInt(COSName.COLUMNS, 1);
                        tempFile = File.createTempFile("tmpPdf", null);
                        fos = new FileOutputStream(tempFile);
                        decompress(encoded, fos);
                        fos.close();
                        fis = new FileInputStream(tempFile);
                        Predictor.decodePredictor(predictor, colors, 
bitsPerPixel, columns, fis, decoded);
                        decoded.flush();
                } finally {
                        IOUtils.closeQuietly(fos);
                        IOUtils.closeQuietly(fis);
                        try { 
                                // try to delete but don't care if it fails
                                                tempFile.delete();
                                        } catch(Exception e) {
                                                LOG.error("Could not delete 
temp data file", e);
                                        }
                }
            }
            else
            {
                decompress(encoded, decoded);
            }
        } 
        catch (DataFormatException e)
        {
            // if the stream is corrupt a DataFormatException may occur
            LOG.error("FlateFilter: stop reading corrupt stream due to a 
DataFormatException");

            // re-throw the exception
            throw new IOException(e);
        }
        return new DecodeResult(parameters);
    }
{code}
The picture OriginalFilters.png shows memory usage during processing this file 
with unmodified filters and the picture ModifiedFilters.png shows memory usage 
during processing same file with modified filters.

For testing purposes, we have created two small applications with same Main 
class and Main method, but different libraries used (one, called 
TestOriginalFilters uses Filters implementation without any change and the 
second one called TestModifiedFilters uses Filters with our modification). 
Since original document contains personal data, we propose given file 
(TEST.pdf) with an almost same resolution for internal testing. The application 
waits for 10 seconds before starting file processing to ensure enough time for 
starting jvisualvm. The application is also prepared for multi-page documents. 
The application MainTest class:
{code:java}
package test;

import java.awt.Dimension;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import javax.swing.ImageIcon;
import javax.swing.JLabel;

import org.apache.pdfbox.io.MemoryUsageSetting;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.rendering.PDFRenderer;

public class MainTest {

        // This is simple test for amount of memory consumption by PDFBox 
library
        public static void main(String[] args) {

                if (args.length != 1) {
                        throw new IllegalArgumentException("File is needed to 
continue");
                }

                String fileName = args[0];

                try {
                        System.out.println("start sleep for 10 second to start 
jvisualvm");
                        Thread.sleep(10000);
                        System.out.println("sleep is over");
                } catch (InterruptedException e) {
                        e.printStackTrace();
                }

                try {
                        File dataFile = new File(fileName);
                        PDDocument document = PDDocument.load(dataFile, 
MemoryUsageSetting.setupMixed(64 * 1024));
                        int pages = document.getNumberOfPages();
                        PDFRenderer renderer = new PDFRenderer(document);
                        List<BufferedImage> images = new ArrayList<>();
                        for (int j = 0; j < pages; j++) {
                                System.out.println("Procesing page with index: 
" + j);
                                long startime = System.nanoTime();
                                BufferedImage image = renderer.renderImage(j, 
computeZoomFactor(document, j, 500));
                                System.out.println("Page with index: " + j + " 
done in " + ((System.nanoTime() - startime)) / 1000000);
                                JLabel result = new JLabel(new 
ImageIcon(image));
                                result.setPreferredSize(new 
Dimension(image.getWidth(), image.getHeight()));
                                images.add(image);
                        }

                        System.out.println("Processing finished");
                } catch (IOException ioe) {
                        ioe.printStackTrace();
                }

        }

        private static float computeZoomFactor(PDDocument document, int 
pageIndex, float width) {
                float docWidth = 
document.getPage(pageIndex).getCropBox().getWidth();
                return width > 0 ? (width / docWidth) : 1.0f;
        }

}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (PDFBOX-4151) FlateFilter, LZWFilter causes double memory usage

Reply via email to