andrewglowacki opened a new issue #949: Network issue during WAL creation results in a 'missing' WAL during recovery URL: https://github.com/apache/accumulo/issues/949 As far as I can tell this is what's happening... When a new WAL is being created, after the header and OPEN mark are written, the new WAL marker is written to Zookeeper for the master. If due to a network interruption, the marker is written, but the tserver is unaware of this, the tserver will delete the WAL from HDFS, leaving an orphaned entry in the metadata table. This then prevents Accumulo from proceeding with ingest for the associated tablets without manual intervention, because it thinks it's missing a WAL. This was observed twice in the last three weeks on a moderately sized cluster. Why is the WAL deleted by the tserver, shouldn't the GC do this? Maybe it should only delete the WAL if it doesn't fail on the Zookeeper step? Note: this only seems to happen in rare circumstances when the cluster is under heavy load. Version 1.9.2
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
