OK, I think I understand. Can you file a jira and post this info + a design doc if you have a design in mind?
Thanks, Hari On Tue, Dec 2, 2014 at 1:00 AM, Santiago Mola <[email protected]> wrote: > 2014-12-02 7:04 GMT+01:00 Hari Shreedharan <[email protected]>: >> Wouldn’t the mark and reset be enough? Do you really need access to the >> underlying offsets? The resettable stream already provides mark and reset >> > As far as I know, that is not enough. I'll explain some use cases and maybe > you can suggest a better approach: > XML deserialization > ================ > XPath and XQuery parsers require you to parse the whole input document > before traversing it. XPath can actually be streamed, but resuming the > stream from an arbitrary point is far from trivial. As far as I know, > streaming XQuery remains a research issue without any standard solution. > The solution we have implemented: read the whole document (.mark()'ing at > the begginning of the resettable stream) and extract all the events at > once. We have to track the event index we returned last, so we use a > PositionTracker where we actually store event index, not stream position. > If the source is re-created after a crash, it will start reading the file > from the beginning, but it will start returning events from the last event > index that is stored in the PositionTracker. > We're working with similar approaches for some internal projects such as > PDF deserialization. > Decompression > ============= > Resuming decompression of a compressed stream at an arbitrary input offset > (the one stored at resettable stream) is usually not possible. Also, there > is no way to map an arbitrary offset in the decompressed stream to an > offset in the compressed stream. So we apply the same mechanism as in the > previous case (but at the ResettableInputStream level): we use a > ResettableInputStream implementation that wraps another > ResettableInputStream. The DecompressInputStream marks the underlying > ResettableInputStream at 0 and starts decompressing, tracking the offset in > the decompressed stream. If resuming is needed, it starts decompressing > from the beginning, but skipping all input until the last offset tracked in > the decompressed stream. This approach is suboptimal, since it requires > using a buffer that is always as large as the maximum batch size in bytes, > but it works otherwise. > Best, > -- > Santiago M. Mola > <http://www.stratio.com/> > Avenida de Europa, 26. Ática 5. 3ª Planta > 28224 Pozuelo de Alarcón, Madrid > Tel: +34 91 352 59 42 // *@stratiobd <https://twitter.com/StratioBD>*
