Re: Kafkaesque Output port

Aldrin Piri Thu, 16 Mar 2017 07:53:01 -0700

Interesting points.

Certainly agree with the difference between the two classes as well as
where output ports are now.  Whether or not there is an extension of the
output port or a whole new component, the shared references/data set is a
common one.  There are a lot of options out there that provide some form of
the functionality but can often be an issue from a sizing and complexity
perspective, especially from the perspective Andre covered with MiNiFi
instances.


Another point for consideration is the scope of this information.  Core
NiFi flows and those components are very much about data whereas those IPs
may not necessarily be data for consumption, per se, but context that
governs how the data flow is operating.  In this case, there is a different
plane of information exchange and may deserve a discrete logical channel to
make this available.

I think we also need to consider how data transits between the two roles as
a logical follow up.  I receive some data, do some processing/enrichment
and now want to make this available for usage in flow control elsewhere.


On Thu, Mar 16, 2017 at 10:08 AM, Bryan Bende <[email protected]> wrote:

> Just wanted to throw out a couple of other ideas...
>
> I ran into a similar situation and ended up creating a web-service at
> the core (HandleHttpRequest/HandleHttpResponse) where the edge
> instances could poll for the latest instructions [1][2]. This works
> well when theres basically one new piece of information that you want
> to push out, like the latest results of some calculation, but it is
> obviously not a general purpose queue.
>
> I believe Joey Frazee was also working on an approach that involved a
> PutSiteToSite processor [3] that functioned similar to a
> RemoteProcessGroup, except the host name property could be set to
> expression language and the preceding processor could fan-out a flow
> file for each host to push to. In this case you would need to know the
> list of MiNiFi hosts to push to.
>
> [1] https://github.com/bbende/nifi-streaming-examples
> [2] https://www.slideshare.net/BryanBende/integrating-nifi-and-flink/23
> [3] https://github.com/jfrazee/nifi-put-site-to-site-bundle
>
> On Thu, Mar 16, 2017 at 9:29 AM, Simon Lucy <[email protected]> wrote:
> > There are two different classes of queues really and you can't mix them
> > semantically.
> >
> > The pubsub model
> >
> >  * where ordering isn't guaranteed,
> >  * messages may appear at least once but can be duplicates
> >  * messages need to be explicitly deleted or aged
> >  * messages may or may not be persisted
> >
> > The event model
> >
> >  * ordering is necessary and guaranteed
> >  * messages appear only once
> >  * once read a message is discarded from the queue
> >  * messages are probably persisted
> >
> > Kafka can be used for both models but Nifi Output Ports are meant to be
> > Event Queues. What you could do though is have an external message broker
> > connect to the Output Port and distribute that to subscribers. It could
> be
> > Kafka, Rabbit, AMQ AWS's SQS/SNS whatever makes sense in the context.
> >
> > There's no need to modify or extend Nifi then.
> >
> > S
> >
> >
> >
> >> Aldrin Piri <mailto:[email protected]>
> >> Thursday, March 16, 2017 1:07 PMvia Postbox
> >> <https://www.postbox-inc.com/?utm_source=email&utm_medium=su
> mlink&utm_campaign=reach>
> >>
> >> Hey Andre,
> >>
> >> Interesting scenario and certainly can understand the need for such
> >> functionality. As a bit of background, my mind traditionally goes to
> >> custom controller services used for referencing datasets typically
> served
> >> up via some service. This means we don't get the Site to Site goodness
> and
> >> may be duplicating effort in terms of transiting information. Aspects of
> >> this need are emerging in some of the initial C2 efforts where we have
> >> this
> >> 1-N dispersion of flows/updates to instances, initial approaches are
> >> certainly reminiscent of the above. I am an advocate for Site to Site
> >> being "the way" data transport is done amongst NiFi/MiNiFi instances and
> >> this is interlaced, in part, with the JIRA to make this an extension
> point
> >> [1]. Perhaps, more simply and to frame in the sense of messaging, we
> have
> >> no way of providing topic semantics between instances and only support
> >> queueing whether that is push/pull. This new component or mode would be
> >> very compelling in conjunction with the introduction of new protocols
> each
> >> with their own operational guarantees/caveats.
> >>
> >> Some of the first thoughts/questions that come to mind are:
> >> * what buffering/age off looks like in context of a connection. In part,
> >> I think we have the framework functionality already in place, but
> requires
> >> a slightly different though process and context.
> >> * management of progress through the "queue", for lack of a better word,
> >> on a given topic and how/where that gets stored/managed. this would be
> the
> >> analog of offsets
> >> * is prioritization still a possibility? at first blush, it seems like
> >> this would no longer make sense and/or be applicable
> >> * what impact does this have on provenance? seems like it would still
> map
> >> correctly; just many child send events for a given piece of data
> >> * what would the sequence of receiving input port look like? just use
> run
> >> schedule we have currently? Presumably this would be used for updates,
> so
> >> I schedule it to check every N minutes and get all the updates since
> then?
> >> (this could potentially be mitigated with backpressure/expiration
> >> configuration on the associated connection).
> >>
> >> I agree there is a certain need to fulfill that seems applicable to a
> >> number of situations and finding a way to support this general data
> >> exchange pattern in a framework level would be most excellent. Look
> >> forward to discussing and exploring a bit more.
> >>
> >> --aldrin
> >>
> >> [1] https://issues.apache.org/jira/browse/NIFI-1820
> >>
> >>
> >> Andre <mailto:[email protected]>
> >> Thursday, March 16, 2017 12:27 PMvia Postbox
> >> <https://www.postbox-inc.com/?utm_source=email&utm_medium=su
> mlink&utm_campaign=reach>
> >> dev,
> >>
> >> I recently created a demo environment where two remote MiNiFi instances
> >> (m1
> >> and m2) were sending diverse range of security telemetry (suspicious
> email
> >> attachments, syslog streams, individual session honeypot logs, merged
> >> honeypot session logs, etc) from edge to DC via S2S Input ports
> >>
> >> Once some of this data was processed at the hub I then used Output ports
> >> to
> >> send contents back to the spokes, where the minifi instances use the
> >> flowfiles contents as arguments of OS commands (called via Gooovy
> >> String.execute().text via ExecuteScript).
> >>
> >> The idea being to show how NiFi can be used in basic security
> >> orchestration
> >> (in this case updating m1's firewall tables with malicious IPs observed
> in
> >> m2 and vice versa).
> >>
> >>
> >> While crafting the demo I noticed the Output ports operate like queues,
> >> therefore if one client consumed data from the port, the other was
> unable
> >> to obtain the same flowfiles.
> >>
> >> This is obviously not an issue when using 2 minifi clients (where I can
> >> just create another output port and clone to content) but wouldn't flow
> >> very well with hundred of clients.
> >>
> >> I wonder if anyone would have a suggestion of how to achieve a N to 1
> >> Output port like that? And if not, I wonder if we should create one?
> >>
> >> Cheers
> >>
> >
> > --
> > Simon Lucy
> > Technologist
> > G30 Consultants Ltd
> > +44 77 20 29 4658
> > <https://www.postbox-inc.com/?utm_source=email&utm_medium=si
> glink&utm_campaign=reach>
>

Re: Kafkaesque Output port

Reply via email to