Github user DomenicPuzio commented on the issue: https://github.com/apache/incubator-metron/pull/359 @cestella, I made the change to the PR name; thanks for the tip there. I completely agree that we don't want to miss out on catching failures due to an enrichment source (like MySQL) or in the enrichment infrastructure. However, after we have put a tuple into the JoinBolt's cache, isn't it already past the point where these pitfalls could occur? If there is a failure in an enrichment, then Storm will time out while waiting for an ack that never takes place, so this message will be re-sent; but if the message is already in the cache, isn't its journey complete? My thought with placing the ack there was (1) at this stage of the topology, that tuple has been correctly processed, and (2) for simplicity's sake so that we wouldn't have to modify `streamMessageMap`. I do agree that acking everything after the join has taken place also makes a lot of sense, and I can work on that if you would like. We saw the duplicate data while running this in our development environment in EC2. Perhaps this is due to different ack timeout settings in the Storm topologies? We've repeated the duplication of data many times on our end.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---