Github user DomenicPuzio commented on the issue:

    https://github.com/apache/incubator-metron/pull/359
  
    @cestella, I made the change to the PR name; thanks for the tip there.
    
    I completely agree that we don't want to miss out on catching failures due 
to an enrichment source (like MySQL) or in the enrichment infrastructure. 
However, after we have put a tuple into the JoinBolt's cache, isn't it already 
past the point where these pitfalls could occur? If there is a failure in an 
enrichment, then Storm will time out while waiting for an ack that never takes 
place, so this message will be re-sent; but if the message is already in the 
cache, isn't its journey complete?
    
    My thought with placing the ack there was (1) at this stage of the 
topology, that tuple has been correctly processed, and (2) for simplicity's 
sake so that we wouldn't have to modify `streamMessageMap`. I do agree that 
acking everything after the join has taken place also makes a lot of sense, and 
I can work on that if you would like.
    
    We saw the duplicate data while running this in our development environment 
in EC2. Perhaps this is due to different ack timeout settings in the Storm 
topologies? We've repeated the duplication of data many times on our end.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to