[ 
https://issues.apache.org/jira/browse/PDFBOX-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson updated PDFBOX-2893:
--------------------------------
    Description: 
Performance issues and memory usage issues surrounding streams are one of the 
few things blocking the release of 2.0 (see  PDFBOX-2301, PDFBOX-2882, 
PDFBOX-2883).

Though we've managed to reduce some of the memory used by RandomAccessBuffer 
and to take advantage of buffering of scratch files, we still have problems 
with the amount of memory which COSStream holds onto. Changes introduced in 2.0 
have resulted in COSStreams having a very complex relationship with classes 
which hold a lot of memory in complex ways (e.g. the fields: tempBuffer, 
filteredBuffer, unfilteredBuffer, filteredStream, unFilteredStream, 
scratchFile). Access to scratch file pages in particular does not seem to be 
well regulated, especially with regards to multithreading (an avenue we'd at 
least like to leave open).

Given recent flux, I'm doubtful that we can ship the current API for COSStream 
w.r.t. RandomAccess without shipping performance issues or flaws which will be 
unfixable without breaking changes.

One of the recent changes to COSStream is that it now exposes a RandomAccess, 
this is so that PDFStreamParser can parse content streams (as well as other 
subclasses which handle xref and object streams). However, streams are 
fundamentally not random access - stream filters are sequential. While the 
consumer of a stream may wish to buffer the data (in memory or scratch) for 
random access, COSStream itself does not need to expose such an elaborate API - 
many pieces of gymnastics are performed inside COSStream to present this 
illusion, at significant cost. We should remove that.

But what about providing a RandomAccess for PDFStreamParser, 
PDFObjectStreamParser, and PDFXrefStreamParser? It turns out that those classes 
don't actually perform random I/O. They perform sequential I/O with a buffer 
for peek/unread.

We need to simplify to get 2.0 fast, lean, and maintainable. Here's what I 
think we should do:

1. Split the interfaces for sequential and random I/O
- Introduce a new SequentialSource interface for sequential I/O, with wrappers 
for RandomAccessRead and InputStream.
- BaseParser will use SequentialSource rather than RandomAccessRead (this will 
be inherited by PDFStreamParser, PDFObjectStreamParser, and 
PDFXrefStreamParser).
- COSParser will use RandomAccessRead and pass a SequentialSource wrapper to 
it's superclass, BaseParser.

2. Remove RandomAccess APIs from COSStream, expose only InputStream and 
OutputStream, as we used to do. We can pass an InputStream to PDFStreamParser 
using a wrapper which implements SequentialSource.


  was:
Performance issues and memory usage issues surrounding streams are one of the 
few things blocking the release of 2.0 (see  PDFBOX-2301, PDFBOX-2882, 
PDFBOX-2883).

Though we've managed to reduce some of the memory used by RandomAccessBuffer 
and to take advantage of buffering of scratch files, we still have problems 
with the amount of memory which COSStream holds onto. Changes introduced in 2.0 
have resulted in COSStreams having a very complex relationship with classes 
which hold a lot of memory in complex ways. Access to scratch file pages in 
particular does not seem to be well regulated, especially with regards to 
multithreading (an avenue we'd at least like to leave open).

Given recent flux, I'm doubtful that we can ship the current API for COSStream 
w.r.t. RandomAccess without shipping performance issues or flaws which will be 
unfixable without breaking changes.

One of the recent changes to COSStream is that it now exposes a RandomAccess, 
this is so that PDFStreamParser can parse content streams (as well as other 
subclasses which handle xref and object streams). However, streams are 
fundamentally not random access - stream filters are sequential. While the 
consumer of a stream may wish to buffer the data (in memory or scratch) for 
random access, COSStream itself does not need to expose such an elaborate API - 
many pieces of gymnastics are performed inside COSStream to present this 
illusion, at significant cost. We should remove that.

But what about providing a RandomAccess for PDFStreamParser, 
PDFObjectStreamParser, and PDFXrefStreamParser? It turns out that those classes 
don't actually perform random I/O. They perform sequential I/O with a buffer 
for peek/unread.

We need to simplify to get 2.0 fast, lean, and maintainable. Here's what I 
think we should do:

1. Split the interfaces for sequential and random I/O
- Introduce a new SequentialSource interface for sequential I/O, with wrappers 
for RandomAccessRead and InputStream.
- BaseParser will use SequentialSource rather than RandomAccessRead (this will 
be inherited by PDFStreamParser, PDFObjectStreamParser, and 
PDFXrefStreamParser).
- COSParser will use RandomAccessRead and pass a SequentialSource wrapper to 
it's superclass, BaseParser.

2. Remove RandomAccess APIs from COSStream, expose only InputStream and 
OutputStream, as we used to do. We can pass an InputStream to PDFStreamParser 
using a wrapper which implements SequentialSource.



> Simplify COSStream encoding and decoding
> ----------------------------------------
>
>                 Key: PDFBOX-2893
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2893
>             Project: PDFBox
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: John Hewson
>            Assignee: John Hewson
>            Priority: Blocker
>             Fix For: 2.0.0
>
>
> Performance issues and memory usage issues surrounding streams are one of the 
> few things blocking the release of 2.0 (see  PDFBOX-2301, PDFBOX-2882, 
> PDFBOX-2883).
> Though we've managed to reduce some of the memory used by RandomAccessBuffer 
> and to take advantage of buffering of scratch files, we still have problems 
> with the amount of memory which COSStream holds onto. Changes introduced in 
> 2.0 have resulted in COSStreams having a very complex relationship with 
> classes which hold a lot of memory in complex ways (e.g. the fields: 
> tempBuffer, filteredBuffer, unfilteredBuffer, filteredStream, 
> unFilteredStream, scratchFile). Access to scratch file pages in particular 
> does not seem to be well regulated, especially with regards to multithreading 
> (an avenue we'd at least like to leave open).
> Given recent flux, I'm doubtful that we can ship the current API for 
> COSStream w.r.t. RandomAccess without shipping performance issues or flaws 
> which will be unfixable without breaking changes.
> One of the recent changes to COSStream is that it now exposes a RandomAccess, 
> this is so that PDFStreamParser can parse content streams (as well as other 
> subclasses which handle xref and object streams). However, streams are 
> fundamentally not random access - stream filters are sequential. While the 
> consumer of a stream may wish to buffer the data (in memory or scratch) for 
> random access, COSStream itself does not need to expose such an elaborate API 
> - many pieces of gymnastics are performed inside COSStream to present this 
> illusion, at significant cost. We should remove that.
> But what about providing a RandomAccess for PDFStreamParser, 
> PDFObjectStreamParser, and PDFXrefStreamParser? It turns out that those 
> classes don't actually perform random I/O. They perform sequential I/O with a 
> buffer for peek/unread.
> We need to simplify to get 2.0 fast, lean, and maintainable. Here's what I 
> think we should do:
> 1. Split the interfaces for sequential and random I/O
> - Introduce a new SequentialSource interface for sequential I/O, with 
> wrappers for RandomAccessRead and InputStream.
> - BaseParser will use SequentialSource rather than RandomAccessRead (this 
> will be inherited by PDFStreamParser, PDFObjectStreamParser, and 
> PDFXrefStreamParser).
> - COSParser will use RandomAccessRead and pass a SequentialSource wrapper to 
> it's superclass, BaseParser.
> 2. Remove RandomAccess APIs from COSStream, expose only InputStream and 
> OutputStream, as we used to do. We can pass an InputStream to PDFStreamParser 
> using a wrapper which implements SequentialSource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to