[ 
https://issues.apache.org/jira/browse/PDFBOX-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844583#comment-16844583
 ] 

Jonathan commented on PDFBOX-4542:
----------------------------------

Yes, I see. Yesterday, I managed to implement decryption on demand, but I don't 
particularly like the implementation and (at least in our case) it doesn't 
yield meaningful performance improvements as the decrypted streams must be 
written anyway before reading them; hence this just delays the memory 
allocation. Might be interesting for some folks anyway if they want to extract 
pdf information not contained in streams.

> Suggestion: Don't load large streams completely into memory, reference them 
> instead
> -----------------------------------------------------------------------------------
>
>                 Key: PDFBOX-4542
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4542
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing, PDModel
>    Affects Versions: 2.0.14
>            Reporter: Jonathan
>            Priority: Minor
>              Labels: Memory, memory, performance
>
> As we processed large PDF files, many of which containing large image 
> streams, we wanted to avoid loading the entire streams into memory. Instead, 
> we implemented a mechanism that merely referenced their location on disk.
> We eventually did this by subclassing COSStream, and then overriding 
> COSParser.parseCOSStream(COSDictionary) to conditionally create our stream. 
> Here is the code, this is currently still a work-in-progress. I've just 
> refactored the entire mechanism.
> {code:java}
> public class ReferencedCOSStream
>    extends COSStream
> {
>    //~ Instance members 
> ------------------------------------------------------------------------------------------------------------------------------
>    boolean isReference = false;
>    File    reference   = null;
>    long    offset      = -1;
>    long    length      = -1;
>    //~ Constructors 
> ----------------------------------------------------------------------------------------------------------------------------------
>    private ReferencedCOSStream(final ScratchFile scratchFile)
>    {
>       super(scratchFile);
>    }
>    //~ Methods 
> ---------------------------------------------------------------------------------------------------------------------------------------
>    public static ReferencedCOSStream createFromCOSStream(final COSStream 
> stream)
>    {
>       final ReferencedCOSStream out = new 
> ReferencedCOSStream(stream.getScratchFile());
>       for (final Map.Entry<COSName, COSBase> entry : stream.entrySet())
>       {
>          out.setItem(entry.getKey(), entry.getValue());
>       }
>       return out;
>    }
>    @Override
>    public COSInputStream createInputStream(final DecodeOptions options)
>       throws IOException
>    {
>       if (this.isReference)
>       {
>          final InputStream in = new SlicedFileInputStream(this.reference, 
> this.offset, this.length);
>          return COSInputStream.create(getFilterList(), this, in, 
> this.getScratchFile(), options);
>       }
>       else
>       {
>          return super.createInputStream(options);
>       }
>    }
>    @Override
>    public InputStream createRawInputStream()
>       throws IOException
>    {
>       if (this.isReference)
>       {
>          return new SlicedFileInputStream(this.reference, this.offset, 
> this.length);
>       }
>       else
>       {
>          return super.createRawInputStream();
>       }
>    }
>    @Override
>    public OutputStream createOutputStream(final COSBase filters)
>       throws IOException
>    {
>       this.isReference = false;
>       return super.createOutputStream(filters);
>    }
>    @Override
>    public OutputStream createRawOutputStream()
>       throws IOException
>    {
>       this.isReference = false;
>       return super.createRawOutputStream();
>    }
>    public void setReference(final File file,
>                             final long offset,
>                             final long length)
>    {
>       this.isReference = true;
>       this.reference   = file;
>       this.offset      = offset;
>       this.length      = length;
>       this.setLong(COSName.LENGTH, length);
>    }
>    //~ Inner Classes 
> ---------------------------------------------------------------------------------------------------------------------------------
>    private class SlicedFileInputStream
>       extends FileInputStream
>    {
>       //~ Instance members 
> ---------------------------------------------------------------------------------------------------------------------------
>       private long       index;
>       private final long length;
>       //~ Constructors 
> -------------------------------------------------------------------------------------------------------------------------------
>       public SlicedFileInputStream(final File file,
>                                    final long offset,
>                                    final long length)
>          throws FileNotFoundException, IOException
>       {
>          super(file);
>          this.length = length;
>          this.skip(offset);
>          this.index = 0;
>       }
>       //~ Methods 
> ------------------------------------------------------------------------------------------------------------------------------------
>       @Override
>       public int available()
>          throws IOException
>       {
>          final long remaining = length - index;
>          if (remaining < 0)
>          {
>             return 0;
>          }
>          return (int)remaining;
>       }
>       @Override
>       public int read(final byte[] b)
>          throws IOException
>       {
>          final int remaining = this.available();
>          final int len       = (remaining < b.length) ? remaining : b.length;
>          index += len;
>          if (len > 0)
>          {
>             return super.read(b, 0, len);
>          }
>          else
>          {
>             return -1;
>          }
>       }
>       @Override
>       public int read(final byte[] b,
>                       final int    off,
>                       int          len)
>          throws IOException
>       {
>          final int remaining = this.available();
>          len   =  (remaining < len) ? remaining : len;
>          index += len;
>          if (len > 0)
>          {
>             return super.read(b, 0, len);
>          }
>          else
>          {
>             return -1;
>          }
>       }
>       @Override
>       public long skip(final long n)
>          throws IOException
>       {
>          index += n;
>          return super.skip(n);
>       }
>       @Override
>       public FileChannel getChannel()
>       {
>          throw new UnsupportedOperationException("Obtaining a FileChannel is 
> not supported because a correct offset cannot be ensured.");
>       }
>    }
> }
> {code}
> {code:java}
>    @Override
>    protected COSStream parseCOSStream(final COSDictionary dic)
>       throws IOException
>    {
>       /*
>        * This needs to be dic.getItem because when we are parsing, the 
> underlying object might still be null.
>        */
>       final COSNumber streamLengthObj = 
> getLength(dic.getItem(COSName.LENGTH), dic.getCOSName(COSName.TYPE));
>       COSStream       stream          = document.createCOSStream(dic);
>       // read 'stream'; this was already tested in parseObjectsDynamically()
>       readString();
>       skipWhiteSpaces();
>       if (streamLengthObj == null)
>       {
>          if (isLenient)
>          {
>             LOG.warn("The stream doesn't provide any stream length, using 
> fallback readUntilEnd, at offset " + source.getPosition());
>          }
>          else
>          {
>             throw new IOException("Missing length for stream.");
>          }
>       }
>       if ((streamLengthObj != null) && (streamLengthObj.longValue() >= 1024))
>       {
>          final long                streamBegPos = source.getPosition();
>          final ReferencedCOSStream refStream    = 
> ReferencedCOSStream.createFromCOSStream(stream);
>          try
>          {
>             readValidStream(null, streamLengthObj);
>          }
>          finally
>          {
>             stream.setItem(COSName.LENGTH, streamLengthObj);
>          }
>          refStream.setReference(new File(reference), streamBegPos, 
> source.getPosition() - streamBegPos);
>          stream = refStream;
>       }
>       else
>       {
>          try(final OutputStream out = stream.createRawOutputStream())
>          {
>             if ((streamLengthObj != null) && 
> validateStreamLength(streamLengthObj.longValue()))
>             {
>                readValidStream(out, streamLengthObj);
>             }
>             else
>             {
>                readUntilEndStream(new EndstreamOutputStream(out));
>             }
>          }
>          finally
>          {
>             stream.setItem(COSName.LENGTH, streamLengthObj);
>          }
>       }
>       final String endStream = readString();
>       if (endStream.equals("endobj") && isLenient)
>       {
>          LOG.warn("stream ends with 'endobj' instead of 'endstream' at offset 
> " + source.getPosition());
>          // avoid follow-up warning about missing endobj
>          source.rewind(ENDOBJ.length);
>       }
>       else if ((endStream.length() > 9) && isLenient && 
> endStream.substring(0, 9).equals(ENDSTREAM_STRING))
>       {
>          LOG.warn("stream ends with '" + endStream + "' instead of 
> 'endstream' at offset " + source.getPosition());
>          // unread the "extra" bytes
>          source.rewind(endStream.substring(9).getBytes(ISO_8859_1).length);
>       }
>       else if (!endStream.equals(ENDSTREAM_STRING))
>       {
>          throw new IOException("Error reading stream, expected='endstream' 
> actual='" + endStream + "' at offset " + source.getPosition());
>       }
>       return stream;
>    }
> {code}
> The class ReferencedCOSStream exposes the underlying data in exactly the same 
> way as it does COSStream, but instead of keeping the storage in memory, it 
> always opens a FileInputStream to retrieve the content. SlicedFileInputStream 
> basically wraps around a FileInputStream and tries to imitate the behaviour 
> of an InputStream for this specific chunk of data.
> I needed to expose some APIs for these classes, the method 
> ReferencedCOSStream.createFromCOSStream(COSStream) would better be located in 
> PDDocument and create the stream directly, I just didn't want to also modify 
> PDDocument.
> Right now, encrypted streams are currently loaded into memory by the 
> SecurityHandler directly after creation. If you want to accept this proposal, 
> it might make sense to move the decryption handling also into COSStream and 
> ReferencedCOSStream and perform it upon request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to