We recently had an issue that caused us to lose the contents of one of our Samza job's checkpoint topics. We were not that concerned about losing the checkpointed offsets and so we restarted the job. We then started seeing some very strange results and were able to trace it back to the fact that changelog paritition mapping changed. We were unaware this data was stored in the checkpoint topic. Can someone explain why this mapping is necessary? I was under the impression that the number of changelog partitions is identical to the number of task instances. If this is so, can't partitions just be assigned based on the task number? Assuming the mapping is necessary, it would be nice if it was deterministic. Looking at JobCoordinator, it seems to be dependent on the order in which things come back in the map produced by the SystemStreamPartitionGrouper. This non-determinism seems to have been the cause of our issues. Obviously data loss is a problem, but it seems like Samza could have recreated the original mapping. Should I file a bug on this?
-- Tommy Becker Senior Software Engineer Digitalsmiths A TiVo Company www.digitalsmiths.com<http://www.digitalsmiths.com> tobec...@tivo.com<mailto:tobec...@tivo.com> ________________________________ This email and any attachments may contain confidential and privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments) by others is prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete this email and any attachments. No employee or agent of TiVo Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo Inc. may only be made by a signed written agreement.