Ooops, nvm about the lack of definition, I missed the main flume ng wiki page :), still... some javadocs on the Source/Sink interfaces would be nice. Hopefully the rest of of my comments are still valid.
Cheers Shu ________________________________________ From: Shu Zhang [[email protected]] Sent: Monday, December 05, 2011 4:30 PM To: [email protected] Cc: Basier Aziz; Robert Mahfoud; Robert Ragno Subject: early flume-ng feedback Hi all, I just took a first look at the alpha2 of flume-ng and thought I'd provide a little early high level feedback. First of all, thanks for doing this re-architecture, we (medio systems) have had problems with the OG and at first glance NG looks very promising and we're very excited to try it out. Ok feedback. Personally I would find it very helpful if Source/Sink/Channel were strongly defined. On the wiki, I see the line: "You still have sources and sinks and they still do the same thing. They are now connected by channels." I'm not finding that to be true. In OG, events are appended to Sinks via append(Event) and polled from Sources via next(). That is, some upstream component is responsible from getting events somehow and appending them to a Sink; some downstream component is responsible for getting events from a Source and processing it. For example, Driver is a special case, acting as the downstream component of a Source and an upstream component of a Sink; its processing is simply appending events to the downstream Sink. In NG, it seems like Sources map to upstream components of an OG Sink and Sinks map to downstream components of an OG Source. A channel maps to the combination of OG Sources and Sinks. That is, I see the high level modeling as: Sources - A component which puts events on a channel. (Implementation defines where those events come from). Sinks - A component which polls events from a channel and applies some processing on them. (Implementation defines processing) Channel - Transport between sources and sinks (Implementation defines durability, transport mechanism, etc.) Please let me know if I have the high level picture correctly. It seems to me, most Source/Sink/Channel do follow the above definition. But the avro stuff seems to be a major divergence. I'm a little confused about AvroSource; it doesn't seem to do much. It appears to be just a vanilla Source that let's you manually pass in events, which it'll then pass straight through to a channel. I'm guessing that's a work in progress? Avro transport, seems to me, best modeled as an AvroChannel, which can link a source and sink which in turn defines where the events come from and what to do with them on the other side; that is modeling avro transport as a channel seems like it might provide more flexibility for less configuration. Also, modeling avro transport as a channel seems like it would make it easier to configure for different reliability levels through composition with other channels. The way AvroSink is written can work, but I see advantages in modeling avro transport as a channel instead, any thoughts? For the most part, I'm a big fan of the high level modeling in NG. One thing I want to bring up is the fact that channel has both put() and take() on it. I'm not seeing the case where the same component would want to both take() an event for processing and also put() an event on to the same channel, since that component has a good chance being the one that ends up take()ing the event back. Because of that, I think it could be a good idea to separate channel into 2 interfaces. I can see channel-like implementations, for which it's more difficult to implement both put() and take() and I don't see the need for both to be in every implementation (though most will be, and that's ok). I guess what I'm thinking is along the lines of interface ChannelPoller { take() } interface ChannelSender { put() } public class FileChannel implements ChannelReceivingSide, ChannelSendingSide {...} public class SomeSource { ChannelSender _channel; ... } public class SomeSink { ChannelPoller _channel; ... } One final thing is, if I have the right idea on the high level modeling then it seems like a method like process() should be defined on the sinks interface and a method like List<Event> getNext() should be defined on the sources interface, thoughts? What mind a sink do if it doesn't have a process() defined? Anyways thanks again for doing this work, I think it's very positive. I'll be talking to people internally about helping out, I think it could be good for all involved. I apologize if I've misunderstood anything or made any wrong assumptions. When we get around to testing it out, I'll get back to you guys on lower level issues. Cheers, Shu
