Hi Shu,

Thanks for taking the time to review NG and for your excellent feedback. I
wish to respond to the high level questions you raised regarding
Source/Channel/Sink, and would leave the other questions for other folks to
jump in on.

> In NG, it seems like Sources map to upstream components of an
> OG Sink and Sinks map to downstream components of an OG
> Source. A channel maps to the combination of OG Sources and
> Sinks. That is, I see the high level modeling as:
> Sources - A component which puts events on a channel.
> (Implementation defines where those events come from).
> Sinks - A component which polls events from a channel and
> applies some processing on them. (Implementation defines
> processing)
> Channel - Transport between sources and sinks (Implementation
> defines durability, transport mechanism, etc.)
>
> Please let me know if I have the high level picture correctly.

This is very near the real picture, but a few subtle details need be
pointed out: The OG implementation used a channel driver which pushed
events from source to sink within a node. On the other hand, in NG
implementation:
* There is no Node concept
* There is no Channel Driver
* Every source and sink have their own thread of execution
* A channel decouples the source(s) from sink(s).
* Together, a set of sources, channels and sinks that operate within the
same JVM is called an Agent.

Think of a Flume Agent as  borker in a traditional messaging infrastructure
which receives messages (events) and then forwards them on their way to the
intended destination. The characteristics of sender of these messages may
be very different from the intended receivers and hence the Channel acts as
a buffer that decouples the two threads of execution that interact with
these two entities.

> For the most part, I'm a big fan of the high level modeling in
> NG. One thing I want to bring up is the fact that channel has
> both put() and take() on it. I'm not seeing the case where the
> same component would want to both take() an event for
> processing and also put() an event on to the same channel,
> since that component has a good chance being the one that
> ends up take()ing the event back. Because of that, I think it
> could be a good idea to separate channel into 2 interfaces.
> I can see channel-like implementations, for which it's more
> difficult to implement both put() and take() and I don't see
> the need for both to be in every implementation (though
> most will be, and that's ok). I guess what I'm thinking is
> along the lines of

Hopefully my earlier explanation answers this question on why the channel
must have both put() and take() methods on it. If you imagine a channel
that has put() but no take(), it's functionality overlaps with a terminal
sink that only consumes messages from the channel it polls.

Thanks,
Arvind




On Mon, Dec 5, 2011 at 5:15 PM, Shu Zhang <[email protected]> wrote:

> Ooops, nvm about the lack of definition, I missed the main flume ng wiki
> page :), still... some javadocs on the Source/Sink interfaces would be
> nice. Hopefully the rest of of my comments are still valid.
>
> Cheers
> Shu
> ________________________________________
> From: Shu Zhang [[email protected]]
> Sent: Monday, December 05, 2011 4:30 PM
> To: [email protected]
> Cc: Basier Aziz; Robert Mahfoud; Robert Ragno
> Subject: early flume-ng feedback
>
> Hi all, I just took a first look at the alpha2 of flume-ng and thought I'd
> provide a little early high level feedback. First of all, thanks for doing
> this re-architecture, we (medio systems) have had problems with the OG and
> at first glance NG looks very promising and we're very excited to try it
> out.
>
> Ok feedback. Personally I would find it very helpful if
> Source/Sink/Channel were strongly defined. On the wiki, I see the line:
>
> "You still have sources and sinks and they still do the same thing. They
> are now connected by channels."
>
> I'm not finding that to be true. In OG, events are appended to Sinks via
> append(Event) and polled from Sources via next(). That is, some upstream
> component is responsible from getting events somehow and appending them to
> a Sink; some downstream component is responsible for getting events from a
> Source and processing it. For example, Driver is a special case, acting as
> the downstream component of a Source and an upstream component of a Sink;
> its processing is simply appending events to the downstream Sink.
>
> In NG, it seems like Sources map to upstream components of an OG Sink and
> Sinks map to downstream components of an OG Source. A channel maps to the
> combination of OG Sources and Sinks. That is, I see the high level modeling
> as:
> Sources - A component which puts events on a channel. (Implementation
> defines where those events come from).
> Sinks - A component which polls events from a channel and applies some
> processing on them. (Implementation defines processing)
> Channel - Transport between sources and sinks (Implementation defines
> durability, transport mechanism, etc.)
>
> Please let me know if I have the high level picture correctly.
>
> It seems to me, most Source/Sink/Channel do follow the above definition.
> But the avro stuff seems to be a major divergence. I'm a little confused
> about AvroSource; it doesn't seem to do much. It appears to be just a
> vanilla Source that let's you manually pass in events, which it'll then
> pass straight through to a channel. I'm guessing that's a work in progress?
> Avro transport, seems to me, best modeled as an AvroChannel, which can
> link a source and sink which in turn defines where the events come from and
> what to do with them on the other side; that is modeling avro transport as
> a channel seems like it might provide more flexibility for less
> configuration. Also, modeling avro transport as a channel seems like it
> would make it easier to configure for different reliability levels through
> composition with other channels. The way AvroSink is written can work, but
> I see advantages in modeling avro transport as a channel instead, any
> thoughts?
>
> For the most part, I'm a big fan of the high level modeling in NG. One
> thing I want to bring up is the fact that channel has both put() and take()
> on it. I'm not seeing the case where the same component would want to both
> take() an event for processing and also put() an event on to the same
> channel, since that component has a good chance being the one that ends up
> take()ing the event back. Because of that, I think it could be a good idea
> to separate channel into 2 interfaces. I can see channel-like
> implementations, for which it's more difficult to implement both put() and
> take() and I don't see the need for both to be in every implementation
> (though most will be, and that's ok). I guess what I'm thinking is along
> the lines of
> interface ChannelPoller { take() }
> interface ChannelSender { put() }
> public class FileChannel implements ChannelReceivingSide,
> ChannelSendingSide {...}
> public class SomeSource { ChannelSender _channel; ... }
> public class SomeSink { ChannelPoller _channel; ... }
>
> One final thing is, if I have the right idea on the high level modeling
> then it seems like a method like process() should be defined on the sinks
> interface and a method like List<Event> getNext() should be defined on the
> sources interface, thoughts? What mind a sink do if it doesn't have a
> process() defined?
>
> Anyways thanks again for doing this work, I think it's very positive. I'll
> be talking to people internally about helping out, I think it could be good
> for all involved. I apologize if I've misunderstood anything or made any
> wrong assumptions. When we get around to testing it out, I'll get back to
> you guys on lower level issues.
>
> Cheers,
> Shu
>

Reply via email to