Wouldn’t the mark and reset be enough? Do you really need access to the underlying offsets? The resettable stream already provides mark and reset
Thanks, Hari On Fri, Nov 28, 2014 at 1:14 AM, Santiago Mola <[email protected]> wrote: > Hi, > We have a recurring need of Flume deserializers that go beyond line or > blob. Some examples are XML deserialization where events are generated with > XPath/XQuery expressions, parsers for XLS, PDF, etc. > There is no proper solution in Flume for these use case. A significant > amount of our projects required workarounds for this such as an external > preprocessing or postprocessing step. > So we have explored the following solutions to the problem: > - Using BlobDeserializer and then using an interceptor (1 to N events) to > perform the transformation. This is currently not possible since an > interceptor must output 0 or 1 event for each input event. This was brought > up in this mailing list long time ago [1] but it seems no one came up with > a viable solution. > - Implementing an EventDeserializer. We have done this in some cases with > different degrees of success. For example, with a XML deserializer with > XPath [2]. The main limitation of this approach is the lack of a common > method for position tracking at the deserializer level. Currently, Flume's > core has a PositionTracker at the Source/InputStream level, which tracks > the input offset. LineDeserializer and BlobDeserializer rely on the > assumption that events can be mapped to an input offset (i.e. an event can > be created by reading only from a given input offset). This assumption is > not valid for more complex use cases (e.g. can't produce events without > reading file headers). This can be solved by using a second PositionTracker > at the deserializer level. Here's a commit with a possible implementation > of this approach [3]. > Do you think this is a problem worth solving in Flume? If yes, what would > be the best approach? > [1] > http://mail-archives.apache.org/mod_mbox/flume-dev/201208.mbox/%3CCABCB9rJ0-puRp1FfPfvyfO41wnMgUh=tifcpgufwxbnyv_p...@mail.gmail.com%3E > [2] > https://github.com/Stratio/flume-ingestion/tree/develop/stratio-deserializers/stratio-xmlxpath-deserializer > [3] > https://github.com/Stratio/flume/commit/a6fac7247b7fc48dec5dc3ab4c658ab4e5c0e753 > Best, > -- > Santiago M. Mola > <http://www.stratio.com/> > Avenida de Europa, 26. Ática 5. 3ª Planta > 28224 Pozuelo de Alarcón, Madrid > Tel: +34 91 352 59 42 // *@stratiobd <https://twitter.com/StratioBD>*
