Re: [DISCUSS] Metron Parsers in Nifi

Justin Leet Thu, 09 Aug 2018 06:15:02 -0700

I'll add onto Mike's discussion with the original set of requirements I had
in mind (and apply feedback on these as necessary!). This is largely
overlap with what Mike said, but I want to make sure it's clear where my
proposal was coming from, so we can improve on it as needed.  James and
Mike are also right, I think I skipped over the benefits of NiFi in general
a bit, so thanks for chiming in there.


- Deploy our bundled parsers without needing custom wrapping on all of them.
- Don't prevent ourselves from building custom wrapping as needed.
- Custom Java parsers with an easy way to hook in, similar to what we
already do in Storm.
- One stop (or at least one format) configuration, for the case when we're
doing some thing in NiFi (parsers) and some elsewhere (enrichment and
indexing). I don't think it'll always be "start in NiFi, end in Storm",
especially as we build out Stellar capability, but I also don't want users
learning a different set of configs and config tools for every platform we
run on.
- Ability to build out parsers and other systems fairly easily, e.g. Spark.
- Support our current use cases (in particular parser chaining as a more
advanced use case).

It really boils down to providing a relatively simple user path to be able
to migrate to NiFi as needed or desired as simply as possible in a very
general way, while not preventing parser by parser enhancements.

On Wed, Aug 8, 2018 at 7:14 PM Michael Miklavcic <
[email protected]> wrote:

> I think it also provides customers greater control over their architecture
> by giving them the flexibility to choose where/how to host their parsers.
>
> To Justin's point about the API, my biggest concern about the RecordReader
> approach is that it is not stable. We already have a similar problem in
> having the TransportClient in ElasticSearch - they are prone to changing it
> in minor versions with the advent of their newer REST API, which is
> problematic for ensuring a stable installation.
>
> From my own perspective, our goal with NiFi, at least in part, should be
> the ability to deploy our core parsing infrastructure, i.e.
>
>    - pre-built parsers
>    - custom java parsers
>    - Stellar transforms
>    - custom stellar transforms
>
> And have the ability to configure it similarly to how we configure parsers
> within Storm. Consistent with our recent parser chaining and aggregation
> feature, users should be able to construct and deploy similar constructs in
> NiFi. The core architectural shift would be that parser code should be
> platform agnostic. We provide the plumbing in Storm, NiFi, and <Spark
> Streaming?, other> and platform architects and devops teams can choose how
> and where to deploy.
>
> Best,
> Mike
>
>
> On Wed, Aug 8, 2018 at 9:57 AM James Sirota <[email protected]> wrote:
>
> > Integration with NiFi would be useful for parsing low-volume telemetries
> > at the edge.  This is a much more resource friendly way to do it than
> > setting up dedicated storm topologies.  The integration would be that the
> > NiFi processor parses the data and pushes it straight into the enrichment
> > topic, saving us the resources of having multiple parsers in storm
> >
> > Thanks,
> > James
> >
> > 07.08.2018, 11:29, "Otto Fowler" <[email protected]>:
> > > Why do we start over. We are going back and forth on implementation,
> and
> > I
> > > don’t think we have the same goals or concerns.
> > >
> > > What would be the requirements or goals of metron integration with
> Nifi?
> > > How many levels or options for integration do we have?
> > > What are the approaches to choose from?
> > > Who are the target users?
> > >
> > > On August 7, 2018 at 12:24:56, Justin Leet ([email protected])
> > wrote:
> > >
> > > So how does the MetronRecordReader roll into everything? It seems like
> > it'd
> > > be more useful on the reader per format approach, but otherwise it
> > doesn't
> > > really seem like we gain much, and it requires getting everything
> linked
> > up
> > > properly to be used. Assuming we looked at doing it that way, is the
> idea
> > > that we'd setup a ControllerService with the MetronRecordReader and a
> > > MetronRecordWriter and then have the StellarTransformRecord processor
> > > configured with those ControllerServices? How do we manage the
> > > configurations of the everything that way? How does the
> ControllerService
> > > get configured with whatever parser(s) are needed in the flow?
> Basically,
> > > what's your vision for how everything would tie together?
> > >
> > > I also forgot to mention this in the original writeup, but there's
> > another
> > > reason to avoid the RecordReader: It's not considered stable. See
> > >
> >
> https://github.com/apache/nifi/blob/master/nifi-commons/nifi-record/src/main/java/org/apache/nifi/serialization/RecordReader.java#L34
> > .
> > > That alone makes me super hesitant to use it, if it can shift out from
> > > under us in even in incremental version.
> > >
> > > I'm also unclear on why StellarTransformRecord processor matters for
> > either
> > > approach. With the Processor approach you could simply follow it up
> with
> > > the Stellar processor, the same way you'd would in the RecordReader
> > > approach. The Stellar processor should be a parallel improvement, not a
> > > conflicting one.
> > >
> > > On Tue, Aug 7, 2018 at 11:50 AM Otto Fowler <[email protected]>
> > wrote:
> > >
> > >>  A Metron Processor itself isn’t really necessary. A
> MetronRecordReader
> > (
> > >>  either the megalithic or a reader per format ) would be a good
> > approach.
> > >>  Then have StellarTransformRecord processor that can do Stellar on
> _any_
> > >>  record, regardless of source.
> > >>
> > >>  On August 7, 2018 at 11:06:22, Justin Leet ([email protected])
> > wrote:
> > >>
> > >>  Thanks for the comments, Otto, this is definitely great feedback. I'd
> > >>  love to respond inline, but the email's already starting to lose it's
> > >>  formatting, so I'll go with the classic "wall of text". Let me know
> if
> > I
> > >>  didn't address everything.
> > >>
> > >>  Loading modules (or jars or whatever) outside of our Processor gives
> us
> > >>  the benefit of making it incredibly easy for a users to create their
> > own
> > >>  parsers. I would definitely expect our own bundled parsers to be
> > included
> > >>  in our base NAR, but loading modules enables users to only have to
> > learn
> > >>  how Metron wants our stuff lined up and just plug it in. Having said
> > that,
> > >>  I could see having a wrapper for our bundled parsers that makes it
> > really
> > >>  easy to just say you want an MetronAsaParser or MetronBroParser, etc.
> > That
> > >>  would give us the best of both worlds, where it's easy to get setup
> our
> > >>  bundled parsers and also trivial to pull in non-bundled parsers. What
> > >>  doing this gives us is an easy way to support (hopefully) every
> parser
> > that
> > >>  gets made, right out of the box, without us needing to build a
> > specialized
> > >>  version of everything until we decide to and without users having to
> > jump
> > >>  through hoops.
> > >>
> > >>  None of this prevents anyone from creating specialized parsers (for
> > perf
> > >>  reasons, or to use the schema registries, or anything else). It's
> > probably
> > >>  worthwhile to package up some of built-in parsers and customize them
> > to use
> > >>  more specialized feature appropriately as we see things get used in
> the
> > >>  wild. Like you said, we could likely provide Avro schemas for some of
> > this
> > >>  and give users a more robust experience on what we choose to support
> > and
> > >>  provide guidance for other things. I'm also worried that building
> > >>  specialized schemas becomes problematic for things like parser
> chaining
> > >>  (where our routers wrap the underlying messages and add on their own
> > info).
> > >>  Going down that road potentially requires anything wrapped to have a
> > >>  specialized schema for the wrapped version in addition to a vanilla
> > version
> > >>  (although please correct me if I'm missing something there, I'll
> openly
> > >>  admit to some shakiness on how that would be handled).
> > >>
> > >>  I also disagree that this is un-Nifi-like, although I'm admittedly
> not
> > as
> > >>  skilled there. The basis for doing this is directly inspired by the
> > >>  JoltTransformer, which is extremely similar to the proposed setup for
> > our
> > >>  parsers: Simply take a spec (in this case the configs, including the
> > >>  fieldTransformations), and delegate a mapping from bytes[] to JSON.
> The
> > >>  Jolt library even has an Expression Language (check out
> > >>
> >
> https://community.hortonworks.com/articles/105965/expression-language-with-jolt-in-apache-nifi.html
> > ),
> > >>  so it's not a foreign concept. I believe Simon Ball has already done
> > some
> > >>  experimenting around with getting Stellar running in NiFi, and I'd
> > love to
> > >>  see Stellar more readily available in NiFi in general.
> > >>
> > >>  Re: the ControllerService, I see this as a way to maintain Metron's
> > use of
> > >>  ZK as the source of config truth. Users could definitely be using
> NiFi
> > and
> > >>  Storm in tandem (parse in NiFi + enrich and index from Storm, for
> > >>  example). Using the ControllerService gives us a ZK instance as the
> > single
> > >>  source of truth. That way we aren't forcing users to go to two
> > different
> > >>  places to manage configs. This also lets us leverage our existing
> > scripts
> > >>  and our existing infrastructure around configs and their management
> and
> > >>  validation very easily. It also gives users a way to port from NiFi
> to
> > >>  Storm or vice-versa without having to migrate configs as well. We
> could
> > >>  also provide the option to configure the Processor itself with the
> data
> > >>  (just don't set up a controller service and provide the json or
> > whatever as
> > >>  one of our properties).
> > >>
> > >>  On Tue, Aug 7, 2018 at 10:12 AM Otto Fowler <[email protected]
> >
> > >>  wrote:
> > >>
> > >>>  I think this is a good idea. As I mentioned in the other thread I’ve
> > >>>  been doing a lot of work on Nifi recently.
> > >>>  I think the important thing is that what is done should be done the
> > NiFi
> > >>>  way, not bolting the Metron composition
> > >>>  onto Nifi. Think of it like the Tao of Unix, the parsers and
> > components
> > >>>  should be single purpose and simple, allowing
> > >>>  exceptional flexibility in composition.
> > >>>
> > >>>  Comments inline.
> > >>>
> > >>>  On August 7, 2018 at 09:27:01, Justin Leet ([email protected])
> > wrote:
> > >>>
> > >>>  Hi all,
> > >>>
> > >>>  There's interest in being able to run Metron parsers in NiFi, rather
> > than
> > >>>
> > >>>  inside Storm. I dug into this a bit, and have some thoughts on how
> we
> > >>>  could
> > >>>  go about this. I'd love feedback on this, along with anything we'd
> > >>>  consider must haves as well as future enhancements.
> > >>>
> > >>>  1. Separate metron-parsers into metron-parsers-common and
> metron-storm
> > >>>  and create metron-parsers-nifi. For this code to be reusable across
> > >>>  platforms (NiFi, Storm, and anything else in the future), we'll need
> > to
> > >>>  decouple our parsers and Storm.
> > >>>
> > >>>  +1. The “parsing code” should be a library that implements an
> > interface
> > >>>  ( another library ).
> > >>>
> > >>>  The Processors and the Storm things can share them.
> > >>>
> > >>>  - There's also some nice fringe benefits around refactoring our code
> > >>>  to be substantially more clear and understandable; something
> > >>>  which came up
> > >>>  while allowing for parser aggregation.
> > >>>  2. Create a MetronProcessor that can run our parsers.
> > >>>  - I took a look at how RecordReader could be leveraged (e.g.
> > >>>  CSVRecordReader), but this is pretty tightly tied into schemas
> > >>>  and is meant
> > >>>  to be used by ControllerServices, which are then used by Processors.
> > >>>  There's friction involved there in terms of schemas, but also in
> > terms of
> > >>>
> > >>>  access to ZK configs and things like parser chaining. We might
> > >>>  be able to
> > >>>  leverage it, but it seems like it'd be fairly shoehorned in
> > >>>  without getting
> > >>>  the schema and other benefits.
> > >>>
> > >>>  We won’t have to provide our ‘no schema processors’ ( grok, csv,
> json
> > ).
> > >>>
> > >>>  All the remaining processors DO have schemas that we know about. We
> > can
> > >>>  just provide the avro schemas the same way we provide the ES
> schemas.
> > >>>
> > >>>  The “parsing” should not be conflated with the transform/stellar in
> > >>>  NiFi. We should make that separate. Running Stellar over Records
> > would be
> > >>>  the best thing.
> > >>>
> > >>>  - This Processor would work similarly to Storm: bytes[] in -> JSON
> > >>>  out.
> > >>>  - There is a Processor
> > >>>  <
> > >>>
> >
> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/JoltTransformJSON.java
> > >>>  >
> > >>>  that
> > >>>  handles loading other JARs that we can model a
> > >>>  MetronParserProcessor off of
> > >>>  that handles classpath/classloader issues (basically just sets up a
> > >>>  classloader specific to what's being loaded and swaps out the
> Thread's
> > >>>  loader when it calls to outside resources).
> > >>>
> > >>>  There should be no reason to load modules outside the NAR. Why do
> you
> > >>>  expect to? If each Metron Processor equiv of a Metron Storm Parser
> is
> > just
> > >>>  parsing to json it shouldn’t need much.And we could package them in
> > the
> > >>>  NAR. I would suggest we have a Processor per Parser to allow for
> > >>>  specialization. It should all be in the nar.
> > >>>
> > >>>  The Stellar Processor, if you would support the works would possibly
> > need
> > >>>  this.
> > >>>
> > >>>  3. Create a MetronZkControllerService to supply our configs to our
> > >>>  processors.
> > >>>  - This is a pretty established NiFi pattern for being able to
> provide
> > >>>  access to other services needed by a Processor (e.g. databases or
> > large
> > >>>  configurations files).
> > >>>  - The same controller service can be used by all Processors to
> manage
> > >>>  configs in a consistent manner.
> > >>>
> > >>>  I think controller services would make sense where needed, I’m just
> > not
> > >>>  sure what you imagine them being needed for?
> > >>>
> > >>>  If the user has NiFi, and a Registry etc, are you saying you imagine
> > them
> > >>>  using Metron + ZK to manage configurations? Or to be using BOTH
> storm
> > >>>  processors and Nifi Processors?
> > >>>
> > >>>  At that point, we can just NAR our controller service and parser
> > processor
> > >>>
> > >>>  up as needed, deploy them to NiFi, and let the user provide a config
> > for
> > >>>  where their custom parsers can be provided (i.e. their parser jar).
> > This
> > >>>  would be 3 nars (processor, controller-service, and
> > controller-service-api
> > >>>
> > >>>  in order to bind the other two together).
> > >>>
> > >>>  Once deployed, our ability to use parsers should fit well into the
> > >>>  standard
> > >>>  NiFi workflow:
> > >>>
> > >>>  1. Create a MetronZkControllerService.
> > >>>  2. Configure the service to point at zookeeper.
> > >>>  3. Create a MetronParser.
> > >>>  4. Configure it to use the controller service + parser jar location
> +
> > >>>  any other needed configs.
> > >>>  5. Use the outputs as needed downstream (either writing out to Kafka
> > or
> > >>>  feeding into more MetronParsers, etc.)
> > >>>
> > >>>  Chaining parsers should ideally become a matter of chaining
> > MetronParsers
> > >>>
> > >>>  (and making sure the enveloping configs carry through properly). For
> > >>>  parser
> > >>>  aggregation, I'd just avoid it entirely until we know it's needed in
> > NiFi.
> > >>>
> > >>>  Justin
> >
> > -------------------
> > Thank you,
> >
> > James Sirota
> > PMC- Apache Metron
> > jsirota AT apache DOT org
> >
> >
>

Re: [DISCUSS] Metron Parsers in Nifi

Reply via email to