[ https://issues.apache.org/jira/browse/PDFBOX-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844583#comment-16844583 ]
Jonathan commented on PDFBOX-4542: ---------------------------------- Yes, I see. Yesterday, I managed to implement decryption on demand, but I don't particularly like the implementation and (at least in our case) it doesn't yield meaningful performance improvements as the decrypted streams must be written anyway before reading them; hence this just delays the memory allocation. Might be interesting for some folks anyway if they want to extract pdf information not contained in streams. > Suggestion: Don't load large streams completely into memory, reference them > instead > ----------------------------------------------------------------------------------- > > Key: PDFBOX-4542 > URL: https://issues.apache.org/jira/browse/PDFBOX-4542 > Project: PDFBox > Issue Type: Improvement > Components: Parsing, PDModel > Affects Versions: 2.0.14 > Reporter: Jonathan > Priority: Minor > Labels: Memory, memory, performance > > As we processed large PDF files, many of which containing large image > streams, we wanted to avoid loading the entire streams into memory. Instead, > we implemented a mechanism that merely referenced their location on disk. > We eventually did this by subclassing COSStream, and then overriding > COSParser.parseCOSStream(COSDictionary) to conditionally create our stream. > Here is the code, this is currently still a work-in-progress. I've just > refactored the entire mechanism. > {code:java} > public class ReferencedCOSStream > extends COSStream > { > //~ Instance members > ------------------------------------------------------------------------------------------------------------------------------ > boolean isReference = false; > File reference = null; > long offset = -1; > long length = -1; > //~ Constructors > ---------------------------------------------------------------------------------------------------------------------------------- > private ReferencedCOSStream(final ScratchFile scratchFile) > { > super(scratchFile); > } > //~ Methods > --------------------------------------------------------------------------------------------------------------------------------------- > public static ReferencedCOSStream createFromCOSStream(final COSStream > stream) > { > final ReferencedCOSStream out = new > ReferencedCOSStream(stream.getScratchFile()); > for (final Map.Entry<COSName, COSBase> entry : stream.entrySet()) > { > out.setItem(entry.getKey(), entry.getValue()); > } > return out; > } > @Override > public COSInputStream createInputStream(final DecodeOptions options) > throws IOException > { > if (this.isReference) > { > final InputStream in = new SlicedFileInputStream(this.reference, > this.offset, this.length); > return COSInputStream.create(getFilterList(), this, in, > this.getScratchFile(), options); > } > else > { > return super.createInputStream(options); > } > } > @Override > public InputStream createRawInputStream() > throws IOException > { > if (this.isReference) > { > return new SlicedFileInputStream(this.reference, this.offset, > this.length); > } > else > { > return super.createRawInputStream(); > } > } > @Override > public OutputStream createOutputStream(final COSBase filters) > throws IOException > { > this.isReference = false; > return super.createOutputStream(filters); > } > @Override > public OutputStream createRawOutputStream() > throws IOException > { > this.isReference = false; > return super.createRawOutputStream(); > } > public void setReference(final File file, > final long offset, > final long length) > { > this.isReference = true; > this.reference = file; > this.offset = offset; > this.length = length; > this.setLong(COSName.LENGTH, length); > } > //~ Inner Classes > --------------------------------------------------------------------------------------------------------------------------------- > private class SlicedFileInputStream > extends FileInputStream > { > //~ Instance members > --------------------------------------------------------------------------------------------------------------------------- > private long index; > private final long length; > //~ Constructors > ------------------------------------------------------------------------------------------------------------------------------- > public SlicedFileInputStream(final File file, > final long offset, > final long length) > throws FileNotFoundException, IOException > { > super(file); > this.length = length; > this.skip(offset); > this.index = 0; > } > //~ Methods > ------------------------------------------------------------------------------------------------------------------------------------ > @Override > public int available() > throws IOException > { > final long remaining = length - index; > if (remaining < 0) > { > return 0; > } > return (int)remaining; > } > @Override > public int read(final byte[] b) > throws IOException > { > final int remaining = this.available(); > final int len = (remaining < b.length) ? remaining : b.length; > index += len; > if (len > 0) > { > return super.read(b, 0, len); > } > else > { > return -1; > } > } > @Override > public int read(final byte[] b, > final int off, > int len) > throws IOException > { > final int remaining = this.available(); > len = (remaining < len) ? remaining : len; > index += len; > if (len > 0) > { > return super.read(b, 0, len); > } > else > { > return -1; > } > } > @Override > public long skip(final long n) > throws IOException > { > index += n; > return super.skip(n); > } > @Override > public FileChannel getChannel() > { > throw new UnsupportedOperationException("Obtaining a FileChannel is > not supported because a correct offset cannot be ensured."); > } > } > } > {code} > {code:java} > @Override > protected COSStream parseCOSStream(final COSDictionary dic) > throws IOException > { > /* > * This needs to be dic.getItem because when we are parsing, the > underlying object might still be null. > */ > final COSNumber streamLengthObj = > getLength(dic.getItem(COSName.LENGTH), dic.getCOSName(COSName.TYPE)); > COSStream stream = document.createCOSStream(dic); > // read 'stream'; this was already tested in parseObjectsDynamically() > readString(); > skipWhiteSpaces(); > if (streamLengthObj == null) > { > if (isLenient) > { > LOG.warn("The stream doesn't provide any stream length, using > fallback readUntilEnd, at offset " + source.getPosition()); > } > else > { > throw new IOException("Missing length for stream."); > } > } > if ((streamLengthObj != null) && (streamLengthObj.longValue() >= 1024)) > { > final long streamBegPos = source.getPosition(); > final ReferencedCOSStream refStream = > ReferencedCOSStream.createFromCOSStream(stream); > try > { > readValidStream(null, streamLengthObj); > } > finally > { > stream.setItem(COSName.LENGTH, streamLengthObj); > } > refStream.setReference(new File(reference), streamBegPos, > source.getPosition() - streamBegPos); > stream = refStream; > } > else > { > try(final OutputStream out = stream.createRawOutputStream()) > { > if ((streamLengthObj != null) && > validateStreamLength(streamLengthObj.longValue())) > { > readValidStream(out, streamLengthObj); > } > else > { > readUntilEndStream(new EndstreamOutputStream(out)); > } > } > finally > { > stream.setItem(COSName.LENGTH, streamLengthObj); > } > } > final String endStream = readString(); > if (endStream.equals("endobj") && isLenient) > { > LOG.warn("stream ends with 'endobj' instead of 'endstream' at offset > " + source.getPosition()); > // avoid follow-up warning about missing endobj > source.rewind(ENDOBJ.length); > } > else if ((endStream.length() > 9) && isLenient && > endStream.substring(0, 9).equals(ENDSTREAM_STRING)) > { > LOG.warn("stream ends with '" + endStream + "' instead of > 'endstream' at offset " + source.getPosition()); > // unread the "extra" bytes > source.rewind(endStream.substring(9).getBytes(ISO_8859_1).length); > } > else if (!endStream.equals(ENDSTREAM_STRING)) > { > throw new IOException("Error reading stream, expected='endstream' > actual='" + endStream + "' at offset " + source.getPosition()); > } > return stream; > } > {code} > The class ReferencedCOSStream exposes the underlying data in exactly the same > way as it does COSStream, but instead of keeping the storage in memory, it > always opens a FileInputStream to retrieve the content. SlicedFileInputStream > basically wraps around a FileInputStream and tries to imitate the behaviour > of an InputStream for this specific chunk of data. > I needed to expose some APIs for these classes, the method > ReferencedCOSStream.createFromCOSStream(COSStream) would better be located in > PDDocument and create the stream directly, I just didn't want to also modify > PDDocument. > Right now, encrypted streams are currently loaded into memory by the > SecurityHandler directly after creation. If you want to accept this proposal, > it might make sense to move the decryption handling also into COSStream and > ReferencedCOSStream and perform it upon request. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org