[
https://issues.apache.org/jira/browse/PDFBOX-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
John Hewson resolved PDFBOX-2893.
---------------------------------
Resolution: Fixed
> Simplify COSStream encoding and decoding
> ----------------------------------------
>
> Key: PDFBOX-2893
> URL: https://issues.apache.org/jira/browse/PDFBOX-2893
> Project: PDFBox
> Issue Type: Improvement
> Affects Versions: 2.0.0
> Reporter: John Hewson
> Assignee: John Hewson
> Priority: Blocker
> Fix For: 2.0.0
>
> Attachments: PDFBOX-2893-2.patch
>
>
> Performance issues and memory usage issues surrounding streams are one of the
> few things blocking the release of 2.0 (see PDFBOX-2301, PDFBOX-2882,
> PDFBOX-2883).
> Though we've managed to reduce some of the memory used by RandomAccessBuffer
> and to take advantage of buffering of scratch files, we still have problems
> with the amount of memory which COSStream holds onto. Changes introduced in
> 2.0 have resulted in COSStreams having a very complex relationship with
> classes which hold a lot of memory in complex ways (e.g. the fields:
> tempBuffer, filteredBuffer, unfilteredBuffer, filteredStream,
> unFilteredStream, scratchFile). Access to scratch file pages in particular
> does not seem to be well regulated, especially with regards to multithreading
> (an avenue we'd at least like to leave open).
> Given recent flux, I'm doubtful that we can ship the current API for
> COSStream w.r.t. RandomAccess without shipping performance issues or flaws
> which will be unfixable without breaking changes.
> One of the recent changes to COSStream is that it now exposes a RandomAccess,
> this is so that PDFStreamParser can parse content streams (as well as other
> subclasses which handle xref and object streams). However, streams are
> fundamentally not random access - stream filters are sequential. While the
> consumer of a stream may wish to buffer the data (in memory or scratch) for
> random access, COSStream itself does not need to expose such an elaborate API
> - many pieces of gymnastics are performed inside COSStream to present this
> illusion, at significant cost. We should remove that.
> But what about providing a RandomAccess for PDFStreamParser,
> PDFObjectStreamParser, and PDFXrefStreamParser? It turns out that those
> classes don't actually perform random I/O. They perform sequential I/O with a
> buffer for peek/unread.
> We need to simplify to get 2.0 fast, lean, and maintainable. Here's what I
> think we should do:
> 1. Split the interfaces for sequential and random I/O
> - Introduce a new SequentialSource interface for sequential I/O, with thin
> wrappers for RandomAccessRead and InputStream.
> - BaseParser will use SequentialSource rather than RandomAccessRead (this
> will be inherited by PDFStreamParser, PDFObjectStreamParser, and
> PDFXrefStreamParser).
> - COSParser will use RandomAccessRead and pass a SequentialSource wrapper to
> it's superclass, BaseParser.
> 2. Remove RandomAccess APIs from COSStream, expose only InputStream and
> OutputStream, as we used to do. We can pass an InputStream to PDFStreamParser
> using a wrapper which implements SequentialSource. This will remove
> tempBuffer, filteredBuffer, and unfilteredBuffer from COSStream, all of which
> hold memory.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]