[jira] [Comment Edited] (PDFBOX-4151) FlateFilter, LZWFilter causes double memory usage

Itai Shaked (JIRA) Wed, 14 Mar 2018 05:08:15 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-4151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398382#comment-16398382
 ]


Itai Shaked edited comment on PDFBOX-4151 at 3/14/18 12:07 PM:
---------------------------------------------------------------

Note that the subsampling in filters is implemented only for JPEG (DCT), 
JBIG2000 and JPX, as it relies on ImageIO subsampling. Since Flate and LZW can 
be used for non-images, and since at the filter level they have no real 
image-decoding mechanisms, they ignore subsampling options. I believe I pointed 
it in my original mail about the subsampling feature, but I may have neglected 
to point it out in the actual issue (PDFBOX-4137)  . 

{{SampledImageReader}} would still allocate a smaller {{BufferedImage}} if 
subsampling is enabled and used (and will effectively perform the subsampling 
itself), but the memory and time savings won't be as dramatic as in the case of 
JPEG, JBIG2000 and JPX streams.  

Edit: Actually, it looks like it may be possible (but not very simple) to 
implement subsampling inside the {{Predictor}} class. But looking at the file 
attached here (TEST.pdf) - it has a JPEG (DCT-encoded) image, so subsampling 
does help in both memory and processing time when rendering it (if used). I'm 
confused as to how the change in {{FlateFilter}} affects it, though. 


was (Author: itai):
Note that the subsampling in filters is implemented only for JPEG (DCT), 
JBIG2000 and JPX, as it relies on ImageIO subsampling. Since Flate and LZW can 
be used for non-images, and since at the filter level they have no real 
image-decoding mechanisms, they ignore subsampling options. I believe I pointed 
it in my original mail about the subsampling feature, but I may have neglected 
to point it out in the actual issue (PDFBOX-4137)  . 

{{SampledImageReader}} would still allocate a smaller {{BufferedImage}} if 
subsampling is enabled and used (and will effectively perform the subsampling 
itself), but the memory and time savings won't be as dramatic as in the case of 
JPEG, JBIG2000 and JPX streams. 

> FlateFilter, LZWFilter causes double memory usage
> -------------------------------------------------
>
>                 Key: PDFBOX-4151
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4151
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Marek Pribula
>            Priority: Major
>         Attachments: ModifiedFilters.png, OriginalFilters.png, TEST.pdf
>
>
> The problem occurred in our production during processing file with size 
> 400kB. The file was generated by the scanner with resolution 5960 x 8430 
> pixels with 8 bit per pixel (unfortunately we have no control over files 
> which should be processed). Our analysis showed that problem is in 
> FlateFilter.decode where uncompressed data are written into 
> ByteArrayOutputStream. Since the final size of the file is unknown to 
> OutputStream its size is growing by internal call Arrays.copyOf. By the end 
> of processing file, this leads to usage of memory at two times file size.
> What we have tried and helped in our case was slightly modification of 
> FlateFilter and LZWFilter decode method implementation. Here is the code 
> snippet of original method body:
> {code:java}
> @Override
> public DecodeResult decode(InputStream encoded, OutputStream decoded,
> COSDictionary parameters, int index) throws IOException
> {
> int predictor = -1;
> final COSDictionary decodeParams = getDecodeParams(parameters, index);
> if (decodeParams != null)
> {
> predictor = decodeParams.getInt(COSName.PREDICTOR);
> }
> try
> {
> if (predictor > 1)
> {
> int colors = Math.min(decodeParams.getInt(COSName.COLORS, 1), 32);
> int bitsPerPixel = decodeParams.getInt(COSName.BITS_PER_COMPONENT, 8);
> int columns = decodeParams.getInt(COSName.COLUMNS, 1);
> ByteArrayOutputStream baos = new ByteArrayOutputStream();
> decompress(encoded, baos);
> ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray());
> Predictor.decodePredictor(predictor, colors, bitsPerPixel, columns, bais, 
> decoded);
> decoded.flush();
> baos.reset();
> bais.reset();
> }
> else
> {
> decompress(encoded, decoded);
> }
> } 
> catch (DataFormatException e)
> {
> // if the stream is corrupt a DataFormatException may occur
> LOG.error("FlateFilter: stop reading corrupt stream due to a 
> DataFormatException");
> // re-throw the exception
> throw new IOException(e);
> }
> return new DecodeResult(parameters);
> }
> {code}
> and here is our implementation:
> {code:java}
> @Override
>     public DecodeResult decode(InputStream encoded, OutputStream decoded,
>                                          COSDictionary parameters, int index) 
> throws IOException
>     {
>         final COSDictionary decodeParams = getDecodeParams(parameters, index);
>         int predictor = decodeParams.getInt(COSName.PREDICTOR);
>         try
>         {
>             if (predictor > 1)
>             {
>               File tempFile = null;
>                 FileOutputStream fos = null;
>                 FileInputStream fis = null;
>                 try {
>                       int colors = 
> Math.min(decodeParams.getInt(COSName.COLORS, 1), 32);
>                       int bitsPerPixel = 
> decodeParams.getInt(COSName.BITS_PER_COMPONENT, 8);
>                       int columns = decodeParams.getInt(COSName.COLUMNS, 1);
>                       tempFile = File.createTempFile("tmpPdf", null);
>                       fos = new FileOutputStream(tempFile);
>                       decompress(encoded, fos);
>                       fos.close();
>                       fis = new FileInputStream(tempFile);
>                       Predictor.decodePredictor(predictor, colors, 
> bitsPerPixel, columns, fis, decoded);
>                       decoded.flush();
>                 } finally {
>                       IOUtils.closeQuietly(fos);
>                       IOUtils.closeQuietly(fis);
>                       try { 
>                               // try to delete but don't care if it fails
>                                               tempFile.delete();
>                                       } catch(Exception e) {
>                                               LOG.error("Could not delete 
> temp data file", e);
>                                       }
>                 }
>             }
>             else
>             {
>                 decompress(encoded, decoded);
>             }
>         } 
>         catch (DataFormatException e)
>         {
>             // if the stream is corrupt a DataFormatException may occur
>             LOG.error("FlateFilter: stop reading corrupt stream due to a 
> DataFormatException");
>             // re-throw the exception
>             throw new IOException(e);
>         }
>         return new DecodeResult(parameters);
>     }
> {code}
> The picture OriginalFilters.png shows memory usage during processing this 
> file with unmodified filters and the picture ModifiedFilters.png shows memory 
> usage during processing same file with modified filters.
> For testing purposes, we have created two small applications with same Main 
> class and Main method, but different libraries used (one, called 
> TestOriginalFilters uses Filters implementation without any change and the 
> second one called TestModifiedFilters uses Filters with our modification). 
> Since original document contains personal data, we propose given file 
> (TEST.pdf) with an almost same resolution for internal testing. The 
> application waits for 10 seconds before starting file processing to ensure 
> enough time for starting jvisualvm. The application is also prepared for 
> multi-page documents. The application MainTest class:
> {code:java}
> package test;
> import java.awt.Dimension;
> import java.awt.image.BufferedImage;
> import java.io.File;
> import java.io.IOException;
> import java.util.ArrayList;
> import java.util.List;
> import javax.swing.ImageIcon;
> import javax.swing.JLabel;
> import org.apache.pdfbox.io.MemoryUsageSetting;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.rendering.PDFRenderer;
> public class MainTest {
>       // This is simple test for amount of memory consumption by PDFBox 
> library
>       public static void main(String[] args) {
>               if (args.length != 1) {
>                       throw new IllegalArgumentException("File is needed to 
> continue");
>               }
>               String fileName = args[0];
>               try {
>                       System.out.println("start sleep for 10 second to start 
> jvisualvm");
>                       Thread.sleep(10000);
>                       System.out.println("sleep is over");
>               } catch (InterruptedException e) {
>                       e.printStackTrace();
>               }
>               try {
>                       File dataFile = new File(fileName);
>                       PDDocument document = PDDocument.load(dataFile, 
> MemoryUsageSetting.setupMixed(64 * 1024));
>                       int pages = document.getNumberOfPages();
>                       PDFRenderer renderer = new PDFRenderer(document);
>                       List<BufferedImage> images = new ArrayList<>();
>                       for (int j = 0; j < pages; j++) {
>                               System.out.println("Procesing page with index: 
> " + j);
>                               long startime = System.nanoTime();
>                               BufferedImage image = renderer.renderImage(j, 
> computeZoomFactor(document, j, 500));
>                               System.out.println("Page with index: " + j + " 
> done in " + ((System.nanoTime() - startime)) / 1000000);
>                               JLabel result = new JLabel(new 
> ImageIcon(image));
>                               result.setPreferredSize(new 
> Dimension(image.getWidth(), image.getHeight()));
>                               images.add(image);
>                       }
>                       System.out.println("Processing finished");
>               } catch (IOException ioe) {
>                       ioe.printStackTrace();
>               }
>       }
>       private static float computeZoomFactor(PDDocument document, int 
> pageIndex, float width) {
>               float docWidth = 
> document.getPage(pageIndex).getCropBox().getWidth();
>               return width > 0 ? (width / docWidth) : 1.0f;
>       }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-4151) FlateFilter, LZWFilter causes double memory usage

Reply via email to