Hi, David, For your questions:
1) In this case Samza recovered but the changelog message was lost. In 0.10.1 KafkaSystemProducer has a race condition: there is small chance the later send success might override the previous failure. The bug is fixed in the upcoming 0.11.0 release (SAMZA-1019). The fix allows you to catch the exception and then you can decide to ignore or rethrow it. In the latter case the container will fail and Samza will guarantee the message will be reprocessed after it restarts. 2) There are several ways that might help in your case: First you can turn on compression for your checkpoint stream. That usually saves about 20% - 30%. Second, you can also bump up the max.requst.size for the producer. In this case you need to make sure the broker also set up the corresponding max message size. Last, you might also try to split the key into subkeys so the value will be smaller. Thanks, Xinyu On Thu, Oct 6, 2016 at 9:30 AM, David Yu <david...@optimizely.com> wrote: > Hi, > > Our Samza job (0.10.1) throws RecordTooLargeExceptions when flushing the KV > store change to the changelog topic, as well as sending outputs to Kafka. > We have two questions to this problem: > > 1. It seems that after the affected containers failed multiple times, the > job was able to recover and move on. This is a bit hard to understand. How > could this be recoverable? We were glad it actually did, but are > uncomfortable not knowing the reason behind it. > 2. We would be the best way to prevent this from happening? Since Samza > serde happens behind the scenes, there does not seem to be a good way to > find out the payload size in bytes before putting into the KV store. Any > suggestions on this? > > Thanks, > David >