ctubbsii commented on issue #535: WAL recovery enhancements and tooling
URL: https://github.com/apache/accumulo/issues/535#issuecomment-398871117
 
 
   I spoke to @keith-turner at length about this, and we (mostly Keith) came to 
the conclusion that these two errors *might* occur if you stop writing to one 
tablet, but continue writing mutations to another, and the logs roll over. In 
both cases, these exceptions could be thrown as a false positive, when there is 
no data to recover for that tablet, because the WAL containing the 
`COMPACTION_START` event could have been garbage collected.
   
   Worse, this scenario may not be tested for, because our continuous ingest 
tests don't ever stop writing to a tablet.
   
   The workaround would be to inspect the WALs and verify that there is no data 
for the tablet which produced the exception during recovery, and remove the 
entries in the affected tablet, and to repeat for each affected tablet. This is 
not ideal, but if somebody can verify that this is what is happening (it's 
still just speculation right now), we could proceed with a fix for 1.9.2. The 
good news is that there shouldn't be any data loss, if this is what is 
happening. It's just an error when there's no data necessary to recover.
   
   Some possible fixes we discussed, if the issue can be verified:
   
   1. Check that there are no data events in the WALs for that tablet, before 
throwing the exception.
   2. Don't mark a WAL inactive prematurely, even if it has only a 
`COMPACTION_START` event with no data.
   
   More investigation is needed to verify the problem, and possible fixes, 
though.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to