Hey all, We've been working on integrating backup/restore into our stack. We have some user tables which override cells -- meaning write the same row/cf/qf/timestamp but with different values. Normally HBase would handle deduping those and returning the most recently written. This is due to the usage of sequenceId in the memstore as a tiebreaker in CellComparator.
We noticed when trying to do an incremental restore (which uses WALPlayer) of one of these tables, we'd non-deterministically get different values returned for these cells... often not the latest. I believe this is because we lose the sequenceId context in WALPlayer. Our WAL encoding drops sequenceIds from cells, but stashes the same sequenceId in each WALEdit. I think we could update WALPlayer (which reads WALEdit and WALEntry) to pull the sequenceId from the WALedit and inject into the cell that gets written to the context. The next step would be to update CellSerialization to pass it along there as well. At this point our existing CellSortReducer would handle appropriately sorting based on sequenceId when timestamps are equal, and the HFiles written by WALPlayer would more accurately reflect what a normal hbase write would do. The sequenceIds would eventually be pruned out by compactions as they usually are. Any concerns with this approach? See jira https://issues.apache.org/jira/browse/HBASE-27649
