[jira] [Comment Edited] (PDFBOX-4151) FlateFilter, LZWFilter causes double memory usage

Itai Shaked (JIRA) Thu, 15 Mar 2018 03:53:39 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-4151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16400205#comment-16400205
 ]


Itai Shaked edited comment on PDFBOX-4151 at 3/15/18 10:52 AM:
---------------------------------------------------------------

I'm attaching a patch making {{Predictor}} implemented as a stream, so no extra 
byte-array streams would be created. I have tested it on a few files, but I saw 
no notable differences in either speed or memory footprint, as I couldn't find 
PDF files with really huge Flate or LZW encoded images which have a predictor 
(biggest I could find was ~1800x600 pixels, or just over 3MB, which I'm 
assuming would be hardly noticeable). 

It would be nice to test it on some really big images, but I don't know where I 
could find such examples. 

P.S: While working on it, I noticed in {{FlateFilter}} there is the constant 
{{int BUFFER_SIZE = 16348}} - I'm assuming it's a typo, and should be 16384 = 
2^14^ ?  


was (Author: itai):
I'm attaching a patch making {{Predictor}} implemented as a stream, so no extra 
byte-array streams would be created. I have tested it on a few files, but I saw 
no notable differences in either speed or memory footprint, as I couldn't find 
PDF files with really huge Flate or LZW encoded images which have a predictor 
(biggest I could find was ~1800x600 pixels, or just over 3MB, which I'm 
assuming would be hardly noticeable). 

It would be nice to test it on some really big images, but I don't know where I 
could find such examples. 

P.S: While working on it, I noticed in {{FlateFilter}} there is the constant 
{{int BUFFER_SIZE = 16348}} - I'm assuming it's a typo, and should be 16384 = 
2^14^ ? ^^ 

> FlateFilter, LZWFilter causes double memory usage
> -------------------------------------------------
>
>                 Key: PDFBOX-4151
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4151
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Marek Pribula
>            Priority: Major
>         Attachments: ModifiedFilters.png, OriginalFilters.png, TEST.pdf, 
> predictor_stream.patch
>
>
> The problem occurred in our production during processing file with size 
> 400kB. The file was generated by the scanner with resolution 5960 x 8430 
> pixels with 8 bit per pixel (unfortunately we have no control over files 
> which should be processed). Our analysis showed that problem is in 
> FlateFilter.decode where uncompressed data are written into 
> ByteArrayOutputStream. Since the final size of the file is unknown to 
> OutputStream its size is growing by internal call Arrays.copyOf. By the end 
> of processing file, this leads to usage of memory at two times file size.
> What we have tried and helped in our case was slightly modification of 
> FlateFilter and LZWFilter decode method implementation. Here is the code 
> snippet of original method body:
> {code:java}
> @Override
> public DecodeResult decode(InputStream encoded, OutputStream decoded,
> COSDictionary parameters, int index) throws IOException
> {
> int predictor = -1;
> final COSDictionary decodeParams = getDecodeParams(parameters, index);
> if (decodeParams != null)
> {
> predictor = decodeParams.getInt(COSName.PREDICTOR);
> }
> try
> {
> if (predictor > 1)
> {
> int colors = Math.min(decodeParams.getInt(COSName.COLORS, 1), 32);
> int bitsPerPixel = decodeParams.getInt(COSName.BITS_PER_COMPONENT, 8);
> int columns = decodeParams.getInt(COSName.COLUMNS, 1);
> ByteArrayOutputStream baos = new ByteArrayOutputStream();
> decompress(encoded, baos);
> ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray());
> Predictor.decodePredictor(predictor, colors, bitsPerPixel, columns, bais, 
> decoded);
> decoded.flush();
> baos.reset();
> bais.reset();
> }
> else
> {
> decompress(encoded, decoded);
> }
> } 
> catch (DataFormatException e)
> {
> // if the stream is corrupt a DataFormatException may occur
> LOG.error("FlateFilter: stop reading corrupt stream due to a 
> DataFormatException");
> // re-throw the exception
> throw new IOException(e);
> }
> return new DecodeResult(parameters);
> }
> {code}
> and here is our implementation:
> {code:java}
> @Override
>     public DecodeResult decode(InputStream encoded, OutputStream decoded,
>                                          COSDictionary parameters, int index) 
> throws IOException
>     {
>         final COSDictionary decodeParams = getDecodeParams(parameters, index);
>         int predictor = decodeParams.getInt(COSName.PREDICTOR);
>         try
>         {
>             if (predictor > 1)
>             {
>               File tempFile = null;
>                 FileOutputStream fos = null;
>                 FileInputStream fis = null;
>                 try {
>                       int colors = 
> Math.min(decodeParams.getInt(COSName.COLORS, 1), 32);
>                       int bitsPerPixel = 
> decodeParams.getInt(COSName.BITS_PER_COMPONENT, 8);
>                       int columns = decodeParams.getInt(COSName.COLUMNS, 1);
>                       tempFile = File.createTempFile("tmpPdf", null);
>                       fos = new FileOutputStream(tempFile);
>                       decompress(encoded, fos);
>                       fos.close();
>                       fis = new FileInputStream(tempFile);
>                       Predictor.decodePredictor(predictor, colors, 
> bitsPerPixel, columns, fis, decoded);
>                       decoded.flush();
>                 } finally {
>                       IOUtils.closeQuietly(fos);
>                       IOUtils.closeQuietly(fis);
>                       try { 
>                               // try to delete but don't care if it fails
>                                               tempFile.delete();
>                                       } catch(Exception e) {
>                                               LOG.error("Could not delete 
> temp data file", e);
>                                       }
>                 }
>             }
>             else
>             {
>                 decompress(encoded, decoded);
>             }
>         } 
>         catch (DataFormatException e)
>         {
>             // if the stream is corrupt a DataFormatException may occur
>             LOG.error("FlateFilter: stop reading corrupt stream due to a 
> DataFormatException");
>             // re-throw the exception
>             throw new IOException(e);
>         }
>         return new DecodeResult(parameters);
>     }
> {code}
> The picture OriginalFilters.png shows memory usage during processing this 
> file with unmodified filters and the picture ModifiedFilters.png shows memory 
> usage during processing same file with modified filters.
> For testing purposes, we have created two small applications with same Main 
> class and Main method, but different libraries used (one, called 
> TestOriginalFilters uses Filters implementation without any change and the 
> second one called TestModifiedFilters uses Filters with our modification). 
> Since original document contains personal data, we propose given file 
> (TEST.pdf) with an almost same resolution for internal testing. The 
> application waits for 10 seconds before starting file processing to ensure 
> enough time for starting jvisualvm. The application is also prepared for 
> multi-page documents. The application MainTest class:
> {code:java}
> package test;
> import java.awt.Dimension;
> import java.awt.image.BufferedImage;
> import java.io.File;
> import java.io.IOException;
> import java.util.ArrayList;
> import java.util.List;
> import javax.swing.ImageIcon;
> import javax.swing.JLabel;
> import org.apache.pdfbox.io.MemoryUsageSetting;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.rendering.PDFRenderer;
> public class MainTest {
>       // This is simple test for amount of memory consumption by PDFBox 
> library
>       public static void main(String[] args) {
>               if (args.length != 1) {
>                       throw new IllegalArgumentException("File is needed to 
> continue");
>               }
>               String fileName = args[0];
>               try {
>                       System.out.println("start sleep for 10 second to start 
> jvisualvm");
>                       Thread.sleep(10000);
>                       System.out.println("sleep is over");
>               } catch (InterruptedException e) {
>                       e.printStackTrace();
>               }
>               try {
>                       File dataFile = new File(fileName);
>                       PDDocument document = PDDocument.load(dataFile, 
> MemoryUsageSetting.setupMixed(64 * 1024));
>                       int pages = document.getNumberOfPages();
>                       PDFRenderer renderer = new PDFRenderer(document);
>                       List<BufferedImage> images = new ArrayList<>();
>                       for (int j = 0; j < pages; j++) {
>                               System.out.println("Procesing page with index: 
> " + j);
>                               long startime = System.nanoTime();
>                               BufferedImage image = renderer.renderImage(j, 
> computeZoomFactor(document, j, 500));
>                               System.out.println("Page with index: " + j + " 
> done in " + ((System.nanoTime() - startime)) / 1000000);
>                               JLabel result = new JLabel(new 
> ImageIcon(image));
>                               result.setPreferredSize(new 
> Dimension(image.getWidth(), image.getHeight()));
>                               images.add(image);
>                       }
>                       System.out.println("Processing finished");
>               } catch (IOException ioe) {
>                       ioe.printStackTrace();
>               }
>       }
>       private static float computeZoomFactor(PDDocument document, int 
> pageIndex, float width) {
>               float docWidth = 
> document.getPage(pageIndex).getCropBox().getWidth();
>               return width > 0 ? (width / docWidth) : 1.0f;
>       }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-4151) FlateFilter, LZWFilter causes double memory usage

Reply via email to