[jira] [Resolved] (PDFBOX-2893) Simplify COSStream encoding and decoding

John Hewson (JIRA) Sun, 09 Aug 2015 23:32:35 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


John Hewson resolved PDFBOX-2893.
---------------------------------
    Resolution: Fixed

> Simplify COSStream encoding and decoding
> ----------------------------------------
>
>                 Key: PDFBOX-2893
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2893
>             Project: PDFBox
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: John Hewson
>            Assignee: John Hewson
>            Priority: Blocker
>             Fix For: 2.0.0
>
>         Attachments: PDFBOX-2893-2.patch
>
>
> Performance issues and memory usage issues surrounding streams are one of the 
> few things blocking the release of 2.0 (see  PDFBOX-2301, PDFBOX-2882, 
> PDFBOX-2883).
> Though we've managed to reduce some of the memory used by RandomAccessBuffer 
> and to take advantage of buffering of scratch files, we still have problems 
> with the amount of memory which COSStream holds onto. Changes introduced in 
> 2.0 have resulted in COSStreams having a very complex relationship with 
> classes which hold a lot of memory in complex ways (e.g. the fields: 
> tempBuffer, filteredBuffer, unfilteredBuffer, filteredStream, 
> unFilteredStream, scratchFile). Access to scratch file pages in particular 
> does not seem to be well regulated, especially with regards to multithreading 
> (an avenue we'd at least like to leave open).
> Given recent flux, I'm doubtful that we can ship the current API for 
> COSStream w.r.t. RandomAccess without shipping performance issues or flaws 
> which will be unfixable without breaking changes.
> One of the recent changes to COSStream is that it now exposes a RandomAccess, 
> this is so that PDFStreamParser can parse content streams (as well as other 
> subclasses which handle xref and object streams). However, streams are 
> fundamentally not random access - stream filters are sequential. While the 
> consumer of a stream may wish to buffer the data (in memory or scratch) for 
> random access, COSStream itself does not need to expose such an elaborate API 
> - many pieces of gymnastics are performed inside COSStream to present this 
> illusion, at significant cost. We should remove that.
> But what about providing a RandomAccess for PDFStreamParser, 
> PDFObjectStreamParser, and PDFXrefStreamParser? It turns out that those 
> classes don't actually perform random I/O. They perform sequential I/O with a 
> buffer for peek/unread.
> We need to simplify to get 2.0 fast, lean, and maintainable. Here's what I 
> think we should do:
> 1. Split the interfaces for sequential and random I/O
> - Introduce a new SequentialSource interface for sequential I/O, with thin 
> wrappers for RandomAccessRead and InputStream.
> - BaseParser will use SequentialSource rather than RandomAccessRead (this 
> will be inherited by PDFStreamParser, PDFObjectStreamParser, and 
> PDFXrefStreamParser).
> - COSParser will use RandomAccessRead and pass a SequentialSource wrapper to 
> it's superclass, BaseParser.
> 2. Remove RandomAccess APIs from COSStream, expose only InputStream and 
> OutputStream, as we used to do. We can pass an InputStream to PDFStreamParser 
> using a wrapper which implements SequentialSource. This will remove 
> tempBuffer, filteredBuffer, and unfilteredBuffer from COSStream, all of which 
> hold memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (PDFBOX-2893) Simplify COSStream encoding and decoding

Reply via email to