[ https://issues.apache.org/jira/browse/PDFBOX-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16843743#comment-16843743 ]
Jonathan edited comment on PDFBOX-4542 at 5/20/19 7:52 AM: ----------------------------------------------------------- I do think "parse on demand" is preferable, but I reckon it's a much steeper step than just lazily loading streams. What do you think about implementing this as "opt-in"? In order to work this requires a file path to the pdf being parsed anyway, so one could implement that if this reference is set explicity, then the referenced streams will be used, otherwise it would just use the default behaviour. This change doesn't solve all problems concerning memory consumption, depending on the pdf the maps from the dictionaries can also play a large role in memory consumption. But in particular for simple image-container-pdfs, such as scans, this has a large impact . Regarding encryption, I thought about just passing the handler to the stream, so then we'd also need to pass object number and generation. I didn't find the code regarding whether the object has been decypted before. I still do think it'd be possible, but there is no harm in just ignoring this for now. was (Author: rahn2): I do think "parse on demand" is preferable, but I reckon it's a much steeper step than just lazily loading streams. What do you think about implementing this as "opt-in"? This requires a file path to the pdf being parsed anyway, so one could implement that if this reference is set explicity, then it will use the referenced streams, otherwise it would just use the default behaviour. This change doesn't solve all problems concerning memory consumption, depending on the pdf the Maps from the dictionaries can also be a large part of memory consumption. But in particular for simple image-container-pdfs, such as scans, this has a large impact . Regarding encryption, I thought about just passing the handler to the stream, so then we'd also need to parse object number and generation. I didn't find the code regarding whether the object has been decypted before. I still do think it'd be possible, but there is no harm in just ignoring this for now. > Suggestion: Don't load large streams completely into memory, reference them > instead > ----------------------------------------------------------------------------------- > > Key: PDFBOX-4542 > URL: https://issues.apache.org/jira/browse/PDFBOX-4542 > Project: PDFBox > Issue Type: Improvement > Components: Parsing, PDModel > Affects Versions: 2.0.14 > Reporter: Jonathan > Priority: Minor > Labels: Memory, memory, performance > > As we processed large PDF files, many of which containing large image > streams, we wanted to avoid loading the entire streams into memory. Instead, > we implemented a mechanism that merely referenced their location on disk. > We eventually did this by subclassing COSStream, and then overriding > COSParser.parseCOSStream(COSDictionary) to conditionally create our stream. > Here is the code, this is currently still a work-in-progress. I've just > refactored the entire mechanism. > {code:java} > public class ReferencedCOSStream > extends COSStream > { > //~ Instance members > ------------------------------------------------------------------------------------------------------------------------------ > boolean isReference = false; > File reference = null; > long offset = -1; > long length = -1; > //~ Constructors > ---------------------------------------------------------------------------------------------------------------------------------- > private ReferencedCOSStream(final ScratchFile scratchFile) > { > super(scratchFile); > } > //~ Methods > --------------------------------------------------------------------------------------------------------------------------------------- > public static ReferencedCOSStream createFromCOSStream(final COSStream > stream) > { > final ReferencedCOSStream out = new > ReferencedCOSStream(stream.getScratchFile()); > for (final Map.Entry<COSName, COSBase> entry : stream.entrySet()) > { > out.setItem(entry.getKey(), entry.getValue()); > } > return out; > } > @Override > public COSInputStream createInputStream(final DecodeOptions options) > throws IOException > { > if (this.isReference) > { > final InputStream in = new SlicedFileInputStream(this.reference, > this.offset, this.length); > return COSInputStream.create(getFilterList(), this, in, > this.getScratchFile(), options); > } > else > { > return super.createInputStream(options); > } > } > @Override > public InputStream createRawInputStream() > throws IOException > { > if (this.isReference) > { > return new SlicedFileInputStream(this.reference, this.offset, > this.length); > } > else > { > return super.createRawInputStream(); > } > } > @Override > public OutputStream createOutputStream(final COSBase filters) > throws IOException > { > this.isReference = false; > return super.createOutputStream(filters); > } > @Override > public OutputStream createRawOutputStream() > throws IOException > { > this.isReference = false; > return super.createRawOutputStream(); > } > public void setReference(final File file, > final long offset, > final long length) > { > this.isReference = true; > this.reference = file; > this.offset = offset; > this.length = length; > this.setLong(COSName.LENGTH, length); > } > //~ Inner Classes > --------------------------------------------------------------------------------------------------------------------------------- > private class SlicedFileInputStream > extends FileInputStream > { > //~ Instance members > --------------------------------------------------------------------------------------------------------------------------- > private long index; > private final long length; > //~ Constructors > ------------------------------------------------------------------------------------------------------------------------------- > public SlicedFileInputStream(final File file, > final long offset, > final long length) > throws FileNotFoundException, IOException > { > super(file); > this.length = length; > this.skip(offset); > this.index = 0; > } > //~ Methods > ------------------------------------------------------------------------------------------------------------------------------------ > @Override > public int available() > throws IOException > { > final long remaining = length - index; > if (remaining < 0) > { > return 0; > } > return (int)remaining; > } > @Override > public int read(final byte[] b) > throws IOException > { > final int remaining = this.available(); > final int len = (remaining < b.length) ? remaining : b.length; > index += len; > if (len > 0) > { > return super.read(b, 0, len); > } > else > { > return -1; > } > } > @Override > public int read(final byte[] b, > final int off, > int len) > throws IOException > { > final int remaining = this.available(); > len = (remaining < len) ? remaining : len; > index += len; > if (len > 0) > { > return super.read(b, 0, len); > } > else > { > return -1; > } > } > @Override > public long skip(final long n) > throws IOException > { > index += n; > return super.skip(n); > } > @Override > public FileChannel getChannel() > { > throw new UnsupportedOperationException("Obtaining a FileChannel is > not supported because a correct offset cannot be ensured."); > } > } > } > {code} > {code:java} > @Override > protected COSStream parseCOSStream(final COSDictionary dic) > throws IOException > { > /* > * This needs to be dic.getItem because when we are parsing, the > underlying object might still be null. > */ > final COSNumber streamLengthObj = > getLength(dic.getItem(COSName.LENGTH), dic.getCOSName(COSName.TYPE)); > COSStream stream = document.createCOSStream(dic); > // read 'stream'; this was already tested in parseObjectsDynamically() > readString(); > skipWhiteSpaces(); > if (streamLengthObj == null) > { > if (isLenient) > { > LOG.warn("The stream doesn't provide any stream length, using > fallback readUntilEnd, at offset " + source.getPosition()); > } > else > { > throw new IOException("Missing length for stream."); > } > } > if ((streamLengthObj != null) && (streamLengthObj.longValue() >= 1024)) > { > final long streamBegPos = source.getPosition(); > final ReferencedCOSStream refStream = > ReferencedCOSStream.createFromCOSStream(stream); > try > { > readValidStream(null, streamLengthObj); > } > finally > { > stream.setItem(COSName.LENGTH, streamLengthObj); > } > refStream.setReference(new File(reference), streamBegPos, > source.getPosition() - streamBegPos); > stream = refStream; > } > else > { > try(final OutputStream out = stream.createRawOutputStream()) > { > if ((streamLengthObj != null) && > validateStreamLength(streamLengthObj.longValue())) > { > readValidStream(out, streamLengthObj); > } > else > { > readUntilEndStream(new EndstreamOutputStream(out)); > } > } > finally > { > stream.setItem(COSName.LENGTH, streamLengthObj); > } > } > final String endStream = readString(); > if (endStream.equals("endobj") && isLenient) > { > LOG.warn("stream ends with 'endobj' instead of 'endstream' at offset > " + source.getPosition()); > // avoid follow-up warning about missing endobj > source.rewind(ENDOBJ.length); > } > else if ((endStream.length() > 9) && isLenient && > endStream.substring(0, 9).equals(ENDSTREAM_STRING)) > { > LOG.warn("stream ends with '" + endStream + "' instead of > 'endstream' at offset " + source.getPosition()); > // unread the "extra" bytes > source.rewind(endStream.substring(9).getBytes(ISO_8859_1).length); > } > else if (!endStream.equals(ENDSTREAM_STRING)) > { > throw new IOException("Error reading stream, expected='endstream' > actual='" + endStream + "' at offset " + source.getPosition()); > } > return stream; > } > {code} > The class ReferencedCOSStream exposes the underlying data in exactly the same > way as it does COSStream, but instead of keeping the storage in memory, it > always opens a FileInputStream to retrieve the content. SlicedFileInputStream > basically wraps around a FileInputStream and tries to imitate the behaviour > of an InputStream for this specific chunk of data. > I needed to expose some APIs for these classes, the method > ReferencedCOSStream.createFromCOSStream(COSStream) would better be located in > PDDocument and create the stream directly, I just didn't want to also modify > PDDocument. > Right now, encrypted streams are currently loaded into memory by the > SecurityHandler directly after creation. If you want to accept this proposal, > it might make sense to move the decryption handling also into COSStream and > ReferencedCOSStream and perform it upon request. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org