Github user ptgoetz commented on the pull request:

    https://github.com/apache/storm/pull/644#issuecomment-124612279
  
    It took me some time to follow what's going on with this patch, so I'll 
document it here for the benefit other reviewers.
    
    It operates by using the concept of a temporary "transaction file" that 
contains the last successfully written/committed file write offset and the 
transaction ID. That file is initially read in the `prepare()` method call, and 
is initialized if it does not already exist. The POJO representing that file is 
stored in a `lastTxn` instance variable, which represents that last committed 
transaction.
    
    In the `beginCommit()` method, it checks the transaction ID against the one 
stored in `lastTxn`. If it is <= (indicating a replay of a batch), it triggers 
recovery (more on that later), otherwise it continues normally (writing data in 
the `execute()` method).
    
    In the `commit()` method, it essentially re-writes the transaction file 
with the new transaction ID and the current file offset after the latest batch 
data was written. This offset serves as the starting point for recovery.
    
    **Recovery**
    When recovery is triggered, it is assumed that there was a failure during 
the `execute()` method, and the data file now contains data from tuples that 
will be replayed. If it were to simply append to that file, there would be 
duplicate records. So it effectively truncates the data file by creating new 
file, and copying everything from the old data file *up to the last 
successfully committed offset,* and discarding the remaining data.
    
    I think this is an interesting approach and is certainly more resilient 
than what we currently have, but I think there's a scaling problem with the 
recovery functionality. Recovery works by essentially copying the old/orphaned 
data file into a new file and skipping any extra, uncommitted data it contains. 
This is fine for small files, but what about the case where the data file is 
really big and the recovery time would take longer than 
`topology.message.timeout.secs`?
    
    Another problem I see is the lack of support for time-based file rotation. 
I understand why it's not supported by this patch, but as I mentioned earlier, 
that's one of the most common use cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to