[GitHub] andrewglowacki opened a new issue #949: Network issue during WAL creation results in a 'missing' WAL during recovery

GitBox Fri, 08 Feb 2019 21:00:14 -0800

andrewglowacki opened a new issue #949: Network issue during WAL creation 
results in a 'missing' WAL during recovery
URL: https://github.com/apache/accumulo/issues/949
 
 
   As far as I can tell this is what's happening...
   
   When a new WAL is being created, after the header and OPEN mark are written, 
the new WAL marker is written to Zookeeper for the master. If due to a network 
interruption, the marker is written, but the tserver is unaware of this, the 
tserver will delete the WAL from HDFS, leaving an orphaned entry in the 
metadata table. This then prevents Accumulo from proceeding with ingest for the 
associated tablets without manual intervention, because it thinks it's missing 
a WAL.
   
   This was observed twice in the last three weeks on a moderately sized 
cluster. Why is the WAL deleted by the tserver, shouldn't the GC do this? Maybe 
it should only delete the WAL if it doesn't fail on the Zookeeper step?
   
   Note: this only seems to happen in rare circumstances when the cluster is 
under heavy load.
   
   Version 1.9.2


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] andrewglowacki opened a new issue #949: Network issue during WAL creation results in a 'missing' WAL during recovery

Reply via email to