Hi,

On 02/27/2012 02:44 AM, Brock Noland wrote:
Hello,

This might be something for the developer guide or it might be
somewhere and I just missed it.  I feel like we should set down some
expectations in regards to:

1) Source behavior when:
   a) Channel put fails
   b) Source started but is unable to obtain new events for some reason
2) Channel behavior when:
   a) Channel capacity exceeded
   b) take when channel is empty
3) Sink behavior when:
   a) Channel take returns null
   b) Sink cannot write to the downstream location
Totally agree. There is little consistency in implementations right now, and part of the problem is some of the interfaces aren't documented. We should probably have a JIRA to document sink and source interfaces including failure patterns

My take on your issues

Sources:
- Channel put failure is pretty clearcut, failure should be returned, and the previous agent should rollback the transaction
- Inability to obtain events should probably be logged at a high level

Channels:
- Only memory channels have a capacity, but when that is exceeded ChannelException seems a clearcut reaction - I think that blocking takes are certainly preferable. That said, I believe that it is more important that a sink return backoff when no data was processed.

Sinks:
- Ready if data was processed.
- Backoff if no data was processed/the sink needs breathing space.
- Rollback and backoff if downstream write failed
- Throw EventDeliveryException if the sink has a serious problem that puts it out of commission(this would result in failover or removal from balancing). This would be for cases where it is suspected that downstream is unavailable long-term(e.g. avrosink has repeatedly failed for X times in a row)

I tried to kick off some discussion about this in regards to sink failover too. When developing the failover sink processor I assumed that a failed sink will throw SinkDeliveryException, see https://issues.apache.org/jira/browse/FLUME-981
This comes about when I noticed some inconsistencies.  For example, a
take in MemoryChannel blocks for a few seconds by default and
JDBCChannel does not (FLUME-998). Combined with HDFSEvent sink, this
causes tremendous amounts of CPU consumption. Also, currently if HDFS
is unavailable for a period, flume needs to be restarted (FLUME-985).

My general thoughts are are based on experience working with JMS based services.

1) Source/Channel/Sink should not require a restart when up or down
stream services are restarted or become temporarily unavailable.
2) Channel capacity being exceeded should not lead to sources dying
and thus requiring a flume restart. This will happen when downstream
destinations slow down for various reasons.
What would be a preferable alternative? For sources with an upstream, they should be able to signal upstream that the transaction needs to be rolled back. Other than that though, throwing away data that couldn't be delivered is the only possibility with a plain channel? Hopefully we can do something like buffers in scribed.

Brock


Reply via email to