Re: Using a PositionTracker in EventDeserializers

Santiago Mola Tue, 02 Dec 2014 01:00:48 -0800

2014-12-02 7:04 GMT+01:00 Hari Shreedharan <[email protected]>:


> Wouldn’t the mark and reset be enough? Do you really need access to the
> underlying offsets? The resettable stream already provides mark and reset
>

As far as I know, that is not enough. I'll explain some use cases and maybe
you can suggest a better approach:

XML deserialization
================

XPath and XQuery parsers require you to parse the whole input document
before traversing it. XPath can actually be streamed, but resuming the
stream from an arbitrary point is far from trivial. As far as I know,
streaming XQuery remains a research issue without any standard solution.

The solution we have implemented: read the whole document (.mark()'ing at
the begginning of the resettable stream) and extract all the events at
once. We have to track the event index we returned last, so we use a
PositionTracker where we actually store event index, not stream position.
If the source is re-created after a crash, it will start reading the file
from the beginning, but it will start returning events from the last event
index that is stored in the PositionTracker.

We're working with similar approaches for some internal projects such as
PDF deserialization.

Decompression
=============

Resuming decompression of a compressed stream at an arbitrary input offset
(the one stored at resettable stream) is usually not possible. Also, there
is no way to map an arbitrary offset in the decompressed stream to an
offset in the compressed stream. So we apply the same mechanism as in the
previous case (but at the ResettableInputStream level): we use a
ResettableInputStream implementation that wraps another
ResettableInputStream. The DecompressInputStream marks the underlying
ResettableInputStream at 0 and starts decompressing, tracking the offset in
the decompressed stream. If resuming is needed, it starts decompressing
from the beginning, but skipping all input until the last offset tracked in
the decompressed stream. This approach is suboptimal, since it requires
using a buffer that is always as large as the maximum batch size in bytes,
but it works otherwise.

Best,
-- 

Santiago M. Mola


<http://www.stratio.com/>
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 352 59 42 // *@stratiobd <https://twitter.com/StratioBD>*

Re: Using a PositionTracker in EventDeserializers

Reply via email to