I was investigating some failure scenarios with the RecoverableMemoryChannel and the JDBC Channel. The first scenario is that a local flume agent writes to a downstream flume agent, but that downstream agent is unavailable.
For the RecoverableMemoryChannel, I was forcing the in-memory queue with a capacity of 5k events to fill up. After writing a batch of 10k events and waiting several seconds, I re-enabled the downstream node. I was hoping that all 10k events would make it through, but I experienced data loss. This is consistent with the exception I was seeing in the logs: "org.apache.flume.ChannelException: Space for commit to queue couldn't be acquired Sinks are likely not keeping up with sources, or the buffer size is too tight" With the JDBC channel, I ran into FLUME-1224, which causes the local agent to crash. You can see more details about my configuration as part of that jira. My question is two-fold: 1) is it a design goal of the RecoverableMemoryChannel to avoid data loss in this scenario? The documentation about it is very scant, and it wasn't clear to me from a quick look at the code. 2) is there some other configuration of channels that can avoid data loss in this scenario? It also seems that there's a need for a MemoryChannel that starts spilling to disk when it overflows, rather than writing every event to disk. Is that in the works or should I file a jira? Thanks, Joe