Re: Using a PositionTracker in EventDeserializers

Hari Shreedharan Thu, 04 Dec 2014 04:35:33 -0800

OK, I think I understand. Can you file a jira and post this info + a design doc 
if you have a design in mind?



Thanks,
Hari

On Tue, Dec 2, 2014 at 1:00 AM, Santiago Mola <[email protected]> wrote:

> 2014-12-02 7:04 GMT+01:00 Hari Shreedharan <[email protected]>:
>> Wouldn’t the mark and reset be enough? Do you really need access to the
>> underlying offsets? The resettable stream already provides mark and reset
>>
> As far as I know, that is not enough. I'll explain some use cases and maybe
> you can suggest a better approach:
> XML deserialization
> ================
> XPath and XQuery parsers require you to parse the whole input document
> before traversing it. XPath can actually be streamed, but resuming the
> stream from an arbitrary point is far from trivial. As far as I know,
> streaming XQuery remains a research issue without any standard solution.
> The solution we have implemented: read the whole document (.mark()'ing at
> the begginning of the resettable stream) and extract all the events at
> once. We have to track the event index we returned last, so we use a
> PositionTracker where we actually store event index, not stream position.
> If the source is re-created after a crash, it will start reading the file
> from the beginning, but it will start returning events from the last event
> index that is stored in the PositionTracker.
> We're working with similar approaches for some internal projects such as
> PDF deserialization.
> Decompression
> =============
> Resuming decompression of a compressed stream at an arbitrary input offset
> (the one stored at resettable stream) is usually not possible. Also, there
> is no way to map an arbitrary offset in the decompressed stream to an
> offset in the compressed stream. So we apply the same mechanism as in the
> previous case (but at the ResettableInputStream level): we use a
> ResettableInputStream implementation that wraps another
> ResettableInputStream. The DecompressInputStream marks the underlying
> ResettableInputStream at 0 and starts decompressing, tracking the offset in
> the decompressed stream. If resuming is needed, it starts decompressing
> from the beginning, but skipping all input until the last offset tracked in
> the decompressed stream. This approach is suboptimal, since it requires
> using a buffer that is always as large as the maximum batch size in bytes,
> but it works otherwise.
> Best,
> -- 
> Santiago M. Mola
> <http://www.stratio.com/>
> Avenida de Europa, 26. Ática 5. 3ª Planta
> 28224 Pozuelo de Alarcón, Madrid
> Tel: +34 91 352 59 42 // *@stratiobd <https://twitter.com/StratioBD>*

Re: Using a PositionTracker in EventDeserializers

Reply via email to