2014-12-02 7:04 GMT+01:00 Hari Shreedharan <[email protected]>:
> Wouldn’t the mark and reset be enough? Do you really need access to the > underlying offsets? The resettable stream already provides mark and reset > As far as I know, that is not enough. I'll explain some use cases and maybe you can suggest a better approach: XML deserialization ================ XPath and XQuery parsers require you to parse the whole input document before traversing it. XPath can actually be streamed, but resuming the stream from an arbitrary point is far from trivial. As far as I know, streaming XQuery remains a research issue without any standard solution. The solution we have implemented: read the whole document (.mark()'ing at the begginning of the resettable stream) and extract all the events at once. We have to track the event index we returned last, so we use a PositionTracker where we actually store event index, not stream position. If the source is re-created after a crash, it will start reading the file from the beginning, but it will start returning events from the last event index that is stored in the PositionTracker. We're working with similar approaches for some internal projects such as PDF deserialization. Decompression ============= Resuming decompression of a compressed stream at an arbitrary input offset (the one stored at resettable stream) is usually not possible. Also, there is no way to map an arbitrary offset in the decompressed stream to an offset in the compressed stream. So we apply the same mechanism as in the previous case (but at the ResettableInputStream level): we use a ResettableInputStream implementation that wraps another ResettableInputStream. The DecompressInputStream marks the underlying ResettableInputStream at 0 and starts decompressing, tracking the offset in the decompressed stream. If resuming is needed, it starts decompressing from the beginning, but skipping all input until the last offset tracked in the decompressed stream. This approach is suboptimal, since it requires using a buffer that is always as large as the maximum batch size in bytes, but it works otherwise. Best, -- Santiago M. Mola <http://www.stratio.com/> Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón, Madrid Tel: +34 91 352 59 42 // *@stratiobd <https://twitter.com/StratioBD>*
