Bram Schuur created HBASE-29784:
-----------------------------------
Summary: DeleteFamilyVersion is not effectuated even though it is
committed to WAL
Key: HBASE-29784
URL: https://issues.apache.org/jira/browse/HBASE-29784
Project: HBase
Issue Type: Bug
Components: regionserver
Affects Versions: 2.6.3
Environment: JDK: 21.0.9
HBAse: 2.6.3
Hadoop: 3.4.2
Arch: x86
OS: Containerized linux
Reporter: Bram Schuur
We are running hbase 2.6.3 as a datastore, sometimes we wipe data through
DeleteFamilyVersion. Every now and then (intermittent, non-deterministic), the
hbase database somehow forgets about a 'DeleteFamilyVersion' that we emitted
for a row, making the data we meant to erase to appear again.
We started capturing more extensive WAL logs for our regions, which shows the
DeleteFamilyVersion we emit is committed to WAL, however the data is still
visible through the api after flushing/compaction of the region. There are no
errors in the logs.
Below a snippet of the data we traced:
Data as queried from the hbase api:
{code}
\x00\x00\xC7\xEBs\xFB\xCA\xDA/cf:\x00/1765693071241000000/Put/vlen=1/seqid=0
\x00\x00\xC7\xEBs\xFB\xCA\xDA/cf:description/1765693071241000000/Put/vlen=4/seqid=0
\x00\x00\xC7\xEBs\xFB\xCA\xDA/cf:domainIdentifier/1765693071241000000/Put/vlen=60/seqid=0
\x00\x00\xC7\xEBs\xFB\xCA\xDA/cf:domainName/1765693071241000000/Put/vlen=8/seqid=0
\x00\x00\xC7\xEBs\xFB\xCA\xDA/cf:identifiers/1765693071241000000/Put/vlen=94/seqid=0
\x00\x00\xC7\xEBs\xFB\xCA\xDA/cf:lastUpdateTimestamp/1765693071241000000/Put/vlen=8/seqid=0
\x00\x00\xC7\xEBs\xFB\xCA\xDA/cf:layerIdentifier/1765693071241000000/Put/vlen=39/seqid=0
\x00\x00\xC7\xEBs\xFB\xCA\xDA/cf:layerName/1765693071241000000/Put/vlen=12/seqid=0
\x00\x00\xC7\xEBs\xFB\xCA\xDA/cf:name/1765693071241000000/Put/vlen=26/seqid=0
\x00\x00\xC7\xEBs\xFB\xCA\xDA/cf:order/1765693071241000000/Put/vlen=10/seqid=0
\x00\x00\xC7\xEBs\xFB\xCA\xDA/cf:properties/1765693071241000000/Put/vlen=208/seqid=0
\x00\x00\xC7\xEBs\xFB\xCA\xDA/cf:tags/1765693071241000000/Put/vlen=184/seqid=0
\x00\x00\xC7\xEBs\xFB\xCA\xDA/cf:typeIdentifier/1765693071241000000/Put/vlen=72/seqid=0
\x00\x00\xC7\xEBs\xFB\xCA\xDA/cf:typeName/1765693071241000000/Put/vlen=10/seqid=0
\x00\x00\xC7\xEBs\xFB\xCA\xDA/cf:~\x00/1765693071241000000/Put/vlen=11/seqid=0
\x00\x00\xC7\xEBs\xFB\xCA\xDA/cf:~e\x06SYNCED\x00\x00\xDA\xB7\xE2\xD3\x8FW/1765693071241000000/Put/vlen=16/seqid=0
{code}
Data in captured WAL:
{code}
...
Sequence=10628094, table=sg__default__vertices,
region=834ed0ff02e8d7d42b88ad5666a4b1e8, at write timestamp=Sun Dec 14 06:17:51
UTC 2025
...
row=\x00\x00\xC7\xEBs\xFB\xCA\xDA, column=cf:~\x00,
timestamp=1765693071241000000, type=Put
value: \x03\x01Componen\xF4
cell total size sum: 96
row=\x00\x00\xC7\xEBs\xFB\xCA\xDA, column=cf:domainIdentifier,
timestamp=1765693071241000000, type=Put
value:
\x02\x03\x01urn:stackpack:stackstate-k8s-agent-v2:shared:domain:agen\xF4
cell total size sum: 160
row=\x00\x00\xC7\xEBs\xFB\xCA\xDA, column=cf:identifiers,
timestamp=1765693071241000000, type=Put
value:
\x02!\x01\x01\x03\x01\xD7\x01urn:process:/i-06fb48dc80ed9944b-preprod-dev.preprod.stackstate.io:68116:1765692998000
cell total size sum: 184
row=\x00\x00\xC7\xEBs\xFB\xCA\xDA, column=cf:domainName,
timestamp=1765693071241000000, type=Put
value: \x02\x03\x01Agen\xF4
cell total size sum: 96
row=\x00\x00\xC7\xEBs\xFB\xCA\xDA, column=cf:typeName,
timestamp=1765693071241000000, type=Put
value: \x02\x03\x01proces\xF3
cell total size sum: 96
row=\x00\x00\xC7\xEBs\xFB\xCA\xDA, column=cf:name,
timestamp=1765693071241000000, type=Put
value: \x02\x03\x01containerd-shim-runc-v\xB2
cell total size sum: 112
row=\x00\x00\xC7\xEBs\xFB\xCA\xDA, column=cf:description,
timestamp=1765693071241000000, type=Put
value: \x02\x03\x01\x81
cell total size sum: 96
row=\x00\x00\xC7\xEBs\xFB\xCA\xDA, column=cf:typeIdentifier,
timestamp=1765693071241000000, type=Put
value:
\x02\x03\x01\xC4\x01urn:stackpack:stackstate-k8s-agent-v2:shared:component-type:process
cell total size sum: 168
row=\x00\x00\xC7\xEBs\xFB\xCA\xDA, column=cf:layerIdentifier,
timestamp=1765693071241000000, type=Put
value: \x02\x03\x01urn:stackpack:common:layer:processe\xF3
cell total size sum: 136
row=\x00\x00\xC7\xEBs\xFB\xCA\xDA, column=cf:layerName,
timestamp=1765693071241000000, type=Put
value: \x02\x03\x01Processe\xF3
cell total size sum: 104
row=\x00\x00\xC7\xEBs\xFB\xCA\xDA, column=cf:properties,
timestamp=1765693071241000000, type=Put
value: \x02
\x01\x04\x03\x01hos\xF4\x03\x01i-06fb48dc80ed9944b-preprod-dev.preprod.stackstate.i\xEF\x03\x01external_i\xE4\x03\x01\xD7\x01urn:process:/i-06fb48dc80ed9944b-preprod-dev.preprod.stackstate.io:68116:1765692998000\x03\x01pi\xE4\x03\x016811\xB6\x03\x01create_tim\xE5\x03\x01176569299800\xB0
cell total size sum: 296
row=\x00\x00\xC7\xEBs\xFB\xCA\xDA, column=cf:tags,
timestamp=1765693071241000000, type=Put
value:
\x02!\x01\x07\x03\x01host:i-06fb48dc80ed9944b-preprod-dev.preprod.stackstate.i\xEF\x03\x01stackpack:agen\xF4\x03\x01pid:6811\xB6\x03\x01user:roo\xF4\x03\x01os:linu\xF8\x03\x01command:/usr/bin/containerd-shim-runc-v\xB2\x03\x01process_category:executabl\xE5
cell total size sum: 272
row=\x00\x00\xC7\xEBs\xFB\xCA\xDA, column=cf:order,
timestamp=1765693071241000000, type=Put
value: \x02\x0A\x00\x00\x00\x00\x00\x00\x00\x00
cell total size sum: 96
row=\x00\x00\xC7\xEBs\xFB\xCA\xDA, column=cf:lastUpdateTimestamp,
timestamp=1765693071241000000, type=Put
value: \x02\x09\x92\xFE\x90\xB8\xE3f
cell total size sum: 112
row=\x00\x00\xC7\xEBs\xFB\xCA\xDA,
column=cf:~e\x06SYNCED\x00\x00\xDA\xB7\xE2\xD3\x8FW,
timestamp=1765693071241000000, type=Put
value: \x01\x06Synced\x00\x00x\x87\xBA\xDE\xF7\xFE
cell total size sum: 112
row=\x00\x00\xC7\xEBs\xFB\xCA\xDA, column=cf:\x00,
timestamp=1765693071241000000, type=Put
value: \x01
cell total size sum: 80
...
position: 1481623
...
Sequence=10628100, table=sg__default__vertices,
region=834ed0ff02e8d7d42b88ad5666a4b1e8, at write timestamp=Sun Dec 14 06:17:51
UTC 2025
...
row=\x00\x00\xC7\xEBs\xFB\xCA\xDA, column=cf:, timestamp=1765693071241000000,
type=DeleteFamilyVersion
value:
cell total size sum: 80
...
position: 1531651
{code}
What could be the cause? I check the bugtracker but found nothing
resembling/matching our symptomps.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)