Hi, Due to design decisions made very early on in Flume NG - specifically the fact that Sink only has a simple process() method - I don't see a good way to get multiple sinks pulling from the same channel in a way that is backwards-compatible with the current implementation.
Probably the "right" way to support this would be to have an interface where the SinkRunner (or something outside of each Sink) is in control of the transaction, and then it can easily send events to each sink serially or in parallel within a single transaction. I think that is basically what you are describing. If you look at SourceRunner and SourceProcessor you will see similar ideas to what you are describing but they are only implemented at the Source->Channel level. The current SinkProcessor is not an analog of SourceProcessor, but if it was then I think that's where this functionality might fit. However what happens when you do that is you have to handle a ton of failure cases and threading models in a very general way, which might be tough to get right for all use cases. I'm not 100% sure, but I think that's why this was not pursued at the time. To me, this seems like a potential design change (it would have to be very carefully thought out) to consider for a future major Flume code line (maybe a Flume 2.x). By the way, if one is trying to get maximum throughput, then duplicating events onto multiple channels, and having different threads running the sinks (the current design) will be faster and more resilient in general than a single thread and a single channel writing to multiple sinks/destinations. The multiple-channel design pattern will allow periodic downtimes or delays on a single sink to not affect the others, assuming the channel sizes are large enough for buffering during downtime and assuming that each sink is fast enough to recover from temporary delays. Without a dedicated buffer per destination, one is at the mercy of the slowest sink at every stage in the transaction. One last thing worth noting is that the current channels are all well ordered. This means that Flume currently provides a weak ordering guarantee (across a single hop). That is a helpful property in the context of testing and validation, as well as is what many people expect if they are storing logs on a single hop. I hope we don't backpedal on that weak ordering guarantee without a really good reason. Regards, Mike On Fri, Aug 10, 2012 at 9:30 PM, Wang, Yongkun | Yongkun | BDD < [email protected]> wrote: > Hi Jhhani, > > Yes, we can use two (or several) channels to fan out data to different > sinks. Then we will have two channels with same data, which may not be an > optimized solution. So I want to use just ONE channel, creating a > processor to pull the data once from the channel, then distributing to > different sinks. > > Regards, > Yongkun Wang > > On 12/08/10 18:07, "Juhani Connolly" <[email protected]> > wrote: > > >Hi Yongkun, > > > >I'm curious why you need to pull the data twice from the sink? Do you > >need all sinks to have read the same amount of data? Normally for the > >case of splitting data into batch and analytics, we will send data from > >the source to two separate channels and have the sinks read from > >separate channels. > > > >On 08/10/2012 02:48 PM, Wang, Yongkun | Yongkun | BDD wrote: > >> Hi Denny, > >> > >> I am working on the patch now, it's not difficult. I have listed the > >> changes in that JIRA. > >> I think you misunderstand my design, I didn't maintain the order of the > >> events. Instead I make sure that each sink will get the same events (or > >> different events specified by selector). > >> > >> Suppose Channel (mc) contains the following events: 4,3,2,1 > >> > >> If simply enable it by configuration, it may work like this: > >> Sink "hsa" may get 1,3; > >> Sink "hsb" may get 2,4; > >> So different sink will get different data. Is this what user wants? > >> > >> > >> In my design, "hsa" and "hsb" will both get "4,3,2,1". This is a typical > >> case when user want to fan-out the data into two places (eg. One for > >>batch > >> and and another for real-time analysis). > >> > >> Regards, > >> Yongkun Wang > >> > >> > >> On 12/08/10 14:29, "Denny Ye" <[email protected]> wrote: > >> > >>> hi Yongkun, > >>> > >>> JIRA can be accessed now. > >>> > >>> I think it might be difficult to understand the order of events from > >>> your thought. If we don't care about the order, can discuss the value > >>>and > >>> feasibility. In my opinion, data ingest flow is order unawareness, at > >>> least, not such important for us. You can try to verify your proposal > >>>and > >>> give us result. It may be some difficulties in keeping transaction with > >>> several Sinks. > >>> > >>> -Regards > >>> Denny Ye > >>> > >>> > >>> 2012/8/10 Wang, Yongkun | Yongkun | BDD <[email protected] > > > >>> > >>>> JIRA is down again? I cannot connect to it and comment there. > >>>> > >>>> I have a proposal in "Transactional Multiplex (fan out) Sink"): > >>>> https://issues.apache.org/jira/browse/FLUME-1435 > >>>> Which contains the design of one channel to multiple sinks. > >>>> > >>>> You can search the email since JIRA cannot be accessed. > >>>> > >>>> I think this is more than a configuration issue. If simply enable > >>>> several > >>>> sinks on the same channel, they will take it either in a round-robin > >>>> mode > >>>> or in a unpredictable mode if the speed of sinks are different. > >>>> > >>>> So it's better to have a even higher level transaction control instead > >>>> of > >>>> the transaction in the process() of each sink, as I describe in > >>>> FLUME-1435. > >>>> > >>>> Regards, > >>>> Yongkun Wang > >>>> > >>>> > >>>> On 12/08/10 12:30, "Denny Ye (JIRA)" <[email protected]> wrote: > >>>> > >>>>> Denny Ye created FLUME-1479: > >>>>> ------------------------------- > >>>>> > >>>>> Summary: Multiple Sinks can connect to single Channel > >>>>> Key: FLUME-1479 > >>>>> URL: > >>>>>https://issues.apache.org/jira/browse/FLUME-1479 > >>>>> Project: Flume > >>>>> Issue Type: Bug > >>>>> Components: Configuration > >>>>> Affects Versions: v1.2.0 > >>>>> Reporter: Denny Ye > >>>>> Assignee: Denny Ye > >>>>> Fix For: v1.3.0 > >>>>> > >>>>> > >>>>> If we has one Channel (mc) and two Sinks (hsa, hsb), then they may be > >>>>> connected with each other with configuration example > >>>>> {quote} > >>>>> agent.sinks.hsa.channel = mc > >>>>> agent.sinks.hsb.channel = mc > >>>>> {quote} > >>>>> It means that there have multiple Sinks can connect to single > >>>>>Channel. > >>>>> Normally, one Sink only can connect to unified Channel > >>>>> > >>>>> -- > >>>>> This message is automatically generated by JIRA. > >>>>> If you think it was sent incorrectly, please contact your JIRA > >>>>> administrators: > >>>>> > >>>>> > https://issues.apache.org/jira/secure/ContactAdministrators!default.js > >>>>>pa > >>>>> For more information on JIRA, see: > >>>> http://www.atlassian.com/software/jira > >>>>> > >>>>> > >>>> > >>>> > >> > >> > > > > > > >
