Github user DomenicPuzio commented on the issue:
https://github.com/apache/incubator-metron/pull/359
@cestella, I made the change to the PR name; thanks for the tip there.
I completely agree that we don't want to miss out on catching failures due
to an enrichment source (like MySQL) or in the enrichment infrastructure.
However, after we have put a tuple into the JoinBolt's cache, isn't it already
past the point where these pitfalls could occur? If there is a failure in an
enrichment, then Storm will time out while waiting for an ack that never takes
place, so this message will be re-sent; but if the message is already in the
cache, isn't its journey complete?
My thought with placing the ack there was (1) at this stage of the
topology, that tuple has been correctly processed, and (2) for simplicity's
sake so that we wouldn't have to modify `streamMessageMap`. I do agree that
acking everything after the join has taken place also makes a lot of sense, and
I can work on that if you would like.
We saw the duplicate data while running this in our development environment
in EC2. Perhaps this is due to different ack timeout settings in the Storm
topologies? We've repeated the duplication of data many times on our end.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---