[
https://issues.apache.org/jira/browse/PDFBOX-4151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16400205#comment-16400205
]
Itai Shaked edited comment on PDFBOX-4151 at 3/15/18 10:52 AM:
---------------------------------------------------------------
I'm attaching a patch making {{Predictor}} implemented as a stream, so no extra
byte-array streams would be created. I have tested it on a few files, but I saw
no notable differences in either speed or memory footprint, as I couldn't find
PDF files with really huge Flate or LZW encoded images which have a predictor
(biggest I could find was ~1800x600 pixels, or just over 3MB, which I'm
assuming would be hardly noticeable).
It would be nice to test it on some really big images, but I don't know where I
could find such examples.
P.S: While working on it, I noticed in {{FlateFilter}} there is the constant
{{int BUFFER_SIZE = 16348}} - I'm assuming it's a typo, and should be 16384 =
2^14^ ?
was (Author: itai):
I'm attaching a patch making {{Predictor}} implemented as a stream, so no extra
byte-array streams would be created. I have tested it on a few files, but I saw
no notable differences in either speed or memory footprint, as I couldn't find
PDF files with really huge Flate or LZW encoded images which have a predictor
(biggest I could find was ~1800x600 pixels, or just over 3MB, which I'm
assuming would be hardly noticeable).
It would be nice to test it on some really big images, but I don't know where I
could find such examples.
P.S: While working on it, I noticed in {{FlateFilter}} there is the constant
{{int BUFFER_SIZE = 16348}} - I'm assuming it's a typo, and should be 16384 =
2^14^ ? ^^
> FlateFilter, LZWFilter causes double memory usage
> -------------------------------------------------
>
> Key: PDFBOX-4151
> URL: https://issues.apache.org/jira/browse/PDFBOX-4151
> Project: PDFBox
> Issue Type: Bug
> Reporter: Marek Pribula
> Priority: Major
> Attachments: ModifiedFilters.png, OriginalFilters.png, TEST.pdf,
> predictor_stream.patch
>
>
> The problem occurred in our production during processing file with size
> 400kB. The file was generated by the scanner with resolution 5960 x 8430
> pixels with 8 bit per pixel (unfortunately we have no control over files
> which should be processed). Our analysis showed that problem is in
> FlateFilter.decode where uncompressed data are written into
> ByteArrayOutputStream. Since the final size of the file is unknown to
> OutputStream its size is growing by internal call Arrays.copyOf. By the end
> of processing file, this leads to usage of memory at two times file size.
> What we have tried and helped in our case was slightly modification of
> FlateFilter and LZWFilter decode method implementation. Here is the code
> snippet of original method body:
> {code:java}
> @Override
> public DecodeResult decode(InputStream encoded, OutputStream decoded,
> COSDictionary parameters, int index) throws IOException
> {
> int predictor = -1;
> final COSDictionary decodeParams = getDecodeParams(parameters, index);
> if (decodeParams != null)
> {
> predictor = decodeParams.getInt(COSName.PREDICTOR);
> }
> try
> {
> if (predictor > 1)
> {
> int colors = Math.min(decodeParams.getInt(COSName.COLORS, 1), 32);
> int bitsPerPixel = decodeParams.getInt(COSName.BITS_PER_COMPONENT, 8);
> int columns = decodeParams.getInt(COSName.COLUMNS, 1);
> ByteArrayOutputStream baos = new ByteArrayOutputStream();
> decompress(encoded, baos);
> ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray());
> Predictor.decodePredictor(predictor, colors, bitsPerPixel, columns, bais,
> decoded);
> decoded.flush();
> baos.reset();
> bais.reset();
> }
> else
> {
> decompress(encoded, decoded);
> }
> }
> catch (DataFormatException e)
> {
> // if the stream is corrupt a DataFormatException may occur
> LOG.error("FlateFilter: stop reading corrupt stream due to a
> DataFormatException");
> // re-throw the exception
> throw new IOException(e);
> }
> return new DecodeResult(parameters);
> }
> {code}
> and here is our implementation:
> {code:java}
> @Override
> public DecodeResult decode(InputStream encoded, OutputStream decoded,
> COSDictionary parameters, int index)
> throws IOException
> {
> final COSDictionary decodeParams = getDecodeParams(parameters, index);
> int predictor = decodeParams.getInt(COSName.PREDICTOR);
> try
> {
> if (predictor > 1)
> {
> File tempFile = null;
> FileOutputStream fos = null;
> FileInputStream fis = null;
> try {
> int colors =
> Math.min(decodeParams.getInt(COSName.COLORS, 1), 32);
> int bitsPerPixel =
> decodeParams.getInt(COSName.BITS_PER_COMPONENT, 8);
> int columns = decodeParams.getInt(COSName.COLUMNS, 1);
> tempFile = File.createTempFile("tmpPdf", null);
> fos = new FileOutputStream(tempFile);
> decompress(encoded, fos);
> fos.close();
> fis = new FileInputStream(tempFile);
> Predictor.decodePredictor(predictor, colors,
> bitsPerPixel, columns, fis, decoded);
> decoded.flush();
> } finally {
> IOUtils.closeQuietly(fos);
> IOUtils.closeQuietly(fis);
> try {
> // try to delete but don't care if it fails
> tempFile.delete();
> } catch(Exception e) {
> LOG.error("Could not delete
> temp data file", e);
> }
> }
> }
> else
> {
> decompress(encoded, decoded);
> }
> }
> catch (DataFormatException e)
> {
> // if the stream is corrupt a DataFormatException may occur
> LOG.error("FlateFilter: stop reading corrupt stream due to a
> DataFormatException");
> // re-throw the exception
> throw new IOException(e);
> }
> return new DecodeResult(parameters);
> }
> {code}
> The picture OriginalFilters.png shows memory usage during processing this
> file with unmodified filters and the picture ModifiedFilters.png shows memory
> usage during processing same file with modified filters.
> For testing purposes, we have created two small applications with same Main
> class and Main method, but different libraries used (one, called
> TestOriginalFilters uses Filters implementation without any change and the
> second one called TestModifiedFilters uses Filters with our modification).
> Since original document contains personal data, we propose given file
> (TEST.pdf) with an almost same resolution for internal testing. The
> application waits for 10 seconds before starting file processing to ensure
> enough time for starting jvisualvm. The application is also prepared for
> multi-page documents. The application MainTest class:
> {code:java}
> package test;
> import java.awt.Dimension;
> import java.awt.image.BufferedImage;
> import java.io.File;
> import java.io.IOException;
> import java.util.ArrayList;
> import java.util.List;
> import javax.swing.ImageIcon;
> import javax.swing.JLabel;
> import org.apache.pdfbox.io.MemoryUsageSetting;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.rendering.PDFRenderer;
> public class MainTest {
> // This is simple test for amount of memory consumption by PDFBox
> library
> public static void main(String[] args) {
> if (args.length != 1) {
> throw new IllegalArgumentException("File is needed to
> continue");
> }
> String fileName = args[0];
> try {
> System.out.println("start sleep for 10 second to start
> jvisualvm");
> Thread.sleep(10000);
> System.out.println("sleep is over");
> } catch (InterruptedException e) {
> e.printStackTrace();
> }
> try {
> File dataFile = new File(fileName);
> PDDocument document = PDDocument.load(dataFile,
> MemoryUsageSetting.setupMixed(64 * 1024));
> int pages = document.getNumberOfPages();
> PDFRenderer renderer = new PDFRenderer(document);
> List<BufferedImage> images = new ArrayList<>();
> for (int j = 0; j < pages; j++) {
> System.out.println("Procesing page with index:
> " + j);
> long startime = System.nanoTime();
> BufferedImage image = renderer.renderImage(j,
> computeZoomFactor(document, j, 500));
> System.out.println("Page with index: " + j + "
> done in " + ((System.nanoTime() - startime)) / 1000000);
> JLabel result = new JLabel(new
> ImageIcon(image));
> result.setPreferredSize(new
> Dimension(image.getWidth(), image.getHeight()));
> images.add(image);
> }
> System.out.println("Processing finished");
> } catch (IOException ioe) {
> ioe.printStackTrace();
> }
> }
> private static float computeZoomFactor(PDDocument document, int
> pageIndex, float width) {
> float docWidth =
> document.getPage(pageIndex).getCropBox().getWidth();
> return width > 0 ? (width / docWidth) : 1.0f;
> }
> }
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]