[ https://issues.apache.org/jira/browse/PDFBOX-4151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tilman Hausherr updated PDFBOX-4151: ------------------------------------ Attachment: pop-bugzilla93476.pdf > FlateFilter, LZWFilter causes double memory usage > ------------------------------------------------- > > Key: PDFBOX-4151 > URL: https://issues.apache.org/jira/browse/PDFBOX-4151 > Project: PDFBox > Issue Type: Bug > Reporter: Marek Pribula > Priority: Major > Attachments: ModifiedFilters.png, OriginalFilters.png, TEST.pdf, > gs-bugzilla690022.pdf, pop-bugzilla93476.pdf, predictor_stream.patch > > > The problem occurred in our production during processing file with size > 400kB. The file was generated by the scanner with resolution 5960 x 8430 > pixels with 8 bit per pixel (unfortunately we have no control over files > which should be processed). Our analysis showed that problem is in > FlateFilter.decode where uncompressed data are written into > ByteArrayOutputStream. Since the final size of the file is unknown to > OutputStream its size is growing by internal call Arrays.copyOf. By the end > of processing file, this leads to usage of memory at two times file size. > What we have tried and helped in our case was slightly modification of > FlateFilter and LZWFilter decode method implementation. Here is the code > snippet of original method body: > {code:java} > @Override > public DecodeResult decode(InputStream encoded, OutputStream decoded, > COSDictionary parameters, int index) throws IOException > { > int predictor = -1; > final COSDictionary decodeParams = getDecodeParams(parameters, index); > if (decodeParams != null) > { > predictor = decodeParams.getInt(COSName.PREDICTOR); > } > try > { > if (predictor > 1) > { > int colors = Math.min(decodeParams.getInt(COSName.COLORS, 1), 32); > int bitsPerPixel = decodeParams.getInt(COSName.BITS_PER_COMPONENT, 8); > int columns = decodeParams.getInt(COSName.COLUMNS, 1); > ByteArrayOutputStream baos = new ByteArrayOutputStream(); > decompress(encoded, baos); > ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray()); > Predictor.decodePredictor(predictor, colors, bitsPerPixel, columns, bais, > decoded); > decoded.flush(); > baos.reset(); > bais.reset(); > } > else > { > decompress(encoded, decoded); > } > } > catch (DataFormatException e) > { > // if the stream is corrupt a DataFormatException may occur > LOG.error("FlateFilter: stop reading corrupt stream due to a > DataFormatException"); > // re-throw the exception > throw new IOException(e); > } > return new DecodeResult(parameters); > } > {code} > and here is our implementation: > {code:java} > @Override > public DecodeResult decode(InputStream encoded, OutputStream decoded, > COSDictionary parameters, int index) > throws IOException > { > final COSDictionary decodeParams = getDecodeParams(parameters, index); > int predictor = decodeParams.getInt(COSName.PREDICTOR); > try > { > if (predictor > 1) > { > File tempFile = null; > FileOutputStream fos = null; > FileInputStream fis = null; > try { > int colors = > Math.min(decodeParams.getInt(COSName.COLORS, 1), 32); > int bitsPerPixel = > decodeParams.getInt(COSName.BITS_PER_COMPONENT, 8); > int columns = decodeParams.getInt(COSName.COLUMNS, 1); > tempFile = File.createTempFile("tmpPdf", null); > fos = new FileOutputStream(tempFile); > decompress(encoded, fos); > fos.close(); > fis = new FileInputStream(tempFile); > Predictor.decodePredictor(predictor, colors, > bitsPerPixel, columns, fis, decoded); > decoded.flush(); > } finally { > IOUtils.closeQuietly(fos); > IOUtils.closeQuietly(fis); > try { > // try to delete but don't care if it fails > tempFile.delete(); > } catch(Exception e) { > LOG.error("Could not delete > temp data file", e); > } > } > } > else > { > decompress(encoded, decoded); > } > } > catch (DataFormatException e) > { > // if the stream is corrupt a DataFormatException may occur > LOG.error("FlateFilter: stop reading corrupt stream due to a > DataFormatException"); > // re-throw the exception > throw new IOException(e); > } > return new DecodeResult(parameters); > } > {code} > The picture OriginalFilters.png shows memory usage during processing this > file with unmodified filters and the picture ModifiedFilters.png shows memory > usage during processing same file with modified filters. > For testing purposes, we have created two small applications with same Main > class and Main method, but different libraries used (one, called > TestOriginalFilters uses Filters implementation without any change and the > second one called TestModifiedFilters uses Filters with our modification). > Since original document contains personal data, we propose given fileĀ > (TEST.pdf) with an almost same resolution for internal testing. The > application waits for 10 seconds before starting file processing to ensure > enough time for starting jvisualvm. The application is also prepared for > multi-page documents. The application MainTest class: > {code:java} > package test; > import java.awt.Dimension; > import java.awt.image.BufferedImage; > import java.io.File; > import java.io.IOException; > import java.util.ArrayList; > import java.util.List; > import javax.swing.ImageIcon; > import javax.swing.JLabel; > import org.apache.pdfbox.io.MemoryUsageSetting; > import org.apache.pdfbox.pdmodel.PDDocument; > import org.apache.pdfbox.rendering.PDFRenderer; > public class MainTest { > // This is simple test for amount of memory consumption by PDFBox > library > public static void main(String[] args) { > if (args.length != 1) { > throw new IllegalArgumentException("File is needed to > continue"); > } > String fileName = args[0]; > try { > System.out.println("start sleep for 10 second to start > jvisualvm"); > Thread.sleep(10000); > System.out.println("sleep is over"); > } catch (InterruptedException e) { > e.printStackTrace(); > } > try { > File dataFile = new File(fileName); > PDDocument document = PDDocument.load(dataFile, > MemoryUsageSetting.setupMixed(64 * 1024)); > int pages = document.getNumberOfPages(); > PDFRenderer renderer = new PDFRenderer(document); > List<BufferedImage> images = new ArrayList<>(); > for (int j = 0; j < pages; j++) { > System.out.println("Procesing page with index: > " + j); > long startime = System.nanoTime(); > BufferedImage image = renderer.renderImage(j, > computeZoomFactor(document, j, 500)); > System.out.println("Page with index: " + j + " > done in " + ((System.nanoTime() - startime)) / 1000000); > JLabel result = new JLabel(new > ImageIcon(image)); > result.setPreferredSize(new > Dimension(image.getWidth(), image.getHeight())); > images.add(image); > } > System.out.println("Processing finished"); > } catch (IOException ioe) { > ioe.printStackTrace(); > } > } > private static float computeZoomFactor(PDDocument document, int > pageIndex, float width) { > float docWidth = > document.getPage(pageIndex).getCropBox().getWidth(); > return width > 0 ? (width / docWidth) : 1.0f; > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org