Hey all,

We've been working on integrating backup/restore into our stack. We have
some user tables which override cells -- meaning write the same
row/cf/qf/timestamp but with different values. Normally HBase would handle
deduping those and returning the most recently written. This is due to the
usage of sequenceId in the memstore as a tiebreaker in CellComparator.

We noticed when trying to do an incremental restore (which uses WALPlayer)
of one of these tables, we'd non-deterministically get different values
returned for these cells... often not the latest. I believe this is because
we lose the sequenceId context in WALPlayer.

Our WAL encoding drops sequenceIds from cells, but stashes the same
sequenceId in each WALEdit. I think we could update WALPlayer (which reads
WALEdit and WALEntry) to pull the sequenceId from the WALedit and inject
into the cell that gets written to the context.

The next step would be to update CellSerialization to pass it along there
as well. At this point our existing CellSortReducer would handle
appropriately sorting based on sequenceId when timestamps are equal, and
the HFiles written by WALPlayer would more accurately reflect what a normal
hbase write would do.  The sequenceIds would eventually be pruned out by
compactions as they usually are.

Any concerns with this approach?

See jira https://issues.apache.org/jira/browse/HBASE-27649

Reply via email to