Re: [PR] PHOENIX-7800 : Add Phoenix 5.3.1 release and features [phoenix-site]

via GitHub Wed, 20 May 2026 00:28:17 -0700


rahulLiving commented on code in PR #21:
URL: https://github.com/apache/phoenix-site/pull/21#discussion_r3271970592



##########
app/pages/_docs/docs/_mdx/(multi-page)/features/phoenix-sync-table.mdx:
##########
@@ -0,0 +1,159 @@
+---
+title: "PhoenixSyncTable Tool"
+description: "Detect data divergence between a source and a target Phoenix 
table across two HBase clusters via a chunked hash comparison driven by 
MapReduce."
+---
+
+`PhoenixSyncTableTool` is a MapReduce-based divergence detector for Phoenix
+tables that are replicated (or migrated) between two HBase clusters. It
+compares chunks of source and target data without transferring full rows over
+the network and records any chunk whose hashes disagree to a Phoenix system
+table for later inspection. Available in Phoenix 5.3.1
+([PHOENIX-7751](https://issues.apache.org/jira/browse/PHOENIX-7751)).
+
+The tool is conceptually similar to HBase's `HashTable`/`SyncTable` pair but
+is Phoenix-aware (respects TTL, `CURRENT_SCN`, tenant id, indexes, and the
+column-encoding scheme) and runs as a **single** MapReduce job with no HDFS
+intermediate. Output is a Phoenix table, queryable with SQL.
+
+`PhoenixSyncTableTool` performs **detection only** in 5.3.1; it does not
+modify the target cluster.

Review Comment:
   I was thinking to highlight `coalescing` and `checkpoint` feature
   
   ```
     `PhoenixSyncTableTool` is a MapReduce-based divergence detector for Phoenix
     tables that are replicated (or migrated) between two HBase clusters. For 
each
     region-aligned chunk it computes an SHA-256 hash on both clusters 
server-side
     and compares only the hashes — full rows never leave their cluster. Chunks
     whose hashes disagree are checkpointed to a Phoenix output table
     (`PHOENIX_SYNC_TABLE_CHECKPOINT`) for later inspection. Available in 
Phoenix
     5.3.1 ([PHOENIX-7751](https://issues.apache.org/jira/browse/PHOENIX-7751)).
   
     The tool is conceptually similar to HBase's `HashTable`/`SyncTable` pair 
but
     is Phoenix-aware (honors tenant id, indexes, the column-encoding scheme, 
and
     a bounded time range via `--from-time`/`--to-time`) and runs as a 
**single**
     MapReduce job, writing results directly to a Phoenix table instead of 
staging
     hashes in HDFS between two jobs. The output table is queryable with SQL.
   
     Two operational properties differentiate it further from 
`HashTable`/`SyncTable`:
   
     - **Resumable via checkpointing.** Both mapper-region completion and
       per-chunk progress are persisted to the checkpoint table during the run.
       On a failure or re-run with the same `(table, target cluster, from-time,
       to-time)` window, completed mapper regions are filtered out of the input
       splits and finished chunks are skipped — no need to redo verified work.
     - **Optional split coalescing (`--coalesce-split`).** When enabled, 
adjacent
       region splits co-located on the same RegionServer are grouped into a
       single mapper, reducing mapper count (and target-cluster RPC fan-out) on
       tables with many small regions. Off by default; enable for wide tables
       where per-mapper overhead dominates.
   
     `PhoenixSyncTableTool` performs **detection only** in 5.3.1; it does not
     modify the target cluster.
   ```



##########
app/pages/_docs/docs/_mdx/(multi-page)/features/phoenix-sync-table.mdx:
##########
@@ -0,0 +1,159 @@
+---
+title: "PhoenixSyncTable Tool"
+description: "Detect data divergence between a source and a target Phoenix 
table across two HBase clusters via a chunked hash comparison driven by 
MapReduce."
+---
+
+`PhoenixSyncTableTool` is a MapReduce-based divergence detector for Phoenix
+tables that are replicated (or migrated) between two HBase clusters. It
+compares chunks of source and target data without transferring full rows over
+the network and records any chunk whose hashes disagree to a Phoenix system
+table for later inspection. Available in Phoenix 5.3.1
+([PHOENIX-7751](https://issues.apache.org/jira/browse/PHOENIX-7751)).
+
+The tool is conceptually similar to HBase's `HashTable`/`SyncTable` pair but
+is Phoenix-aware (respects TTL, `CURRENT_SCN`, tenant id, indexes, and the
+column-encoding scheme) and runs as a **single** MapReduce job with no HDFS
+intermediate. Output is a Phoenix table, queryable with SQL.
+
+`PhoenixSyncTableTool` performs **detection only** in 5.3.1; it does not
+modify the target cluster.
+
+## When to use it [#sync-table-when]
+
+Reach for `PhoenixSyncTableTool` to verify:
+
+- A cluster migration that used HBase snapshots, replication, or both — to
+  confirm the target is byte-for-byte identical after cutover.
+- Long-running HBase replication — to detect cases where a replication peer
+  has silently drifted.
+- DR drills — to confirm the standby is in sync before a planned failover.
+
+For ad-hoc row-count or row-key spot-checks you usually want a small SQL
+query instead; `PhoenixSyncTableTool` is the right choice when you need
+**full-data** confidence with bounded network cost.
+
+## Running the tool [#sync-table-running]
+
+The tool runs through `hbase` (or `hadoop jar`) and takes only two mandatory
+flags — the source table name and the target cluster's ZooKeeper quorum.
+
+```bash
+hbase org.apache.phoenix.mapreduce.PhoenixSyncTableTool \
+  --table-name MY_SCHEMA.MY_TABLE \
+  --target-cluster zk1,zk2,zk3:2181:/hbase \
+  --run-foreground
+```
+
+The source cluster comes from the Hadoop/HBase configuration the job is
+submitted under, so `--target-cluster` is the ZooKeeper quorum of the
+**other** cluster. Accepted quorum formats:
+
+- `host:port:/znode`
+- `h1,h2:port:/znode`
+- `h1:p1,h2:p2:/znode`
+
+### Flags
+
+| Short     | Long                  | Required | Default              | 
Purpose                                                                         
                                         |
+| --------- | --------------------- | :------: | -------------------- | 
------------------------------------------------------------------------------------------------------------------------
 |
+| `-tn`     | `--table-name`        |   yes    | —                    | Source 
table (physical name; index physical names are also accepted).                  
                                  |
+| `-tc`     | `--target-cluster`    |   yes    | —                    | ZK 
quorum of the target cluster.                                                   
                                      |
+| `-s`      | `--schema`            |    no    | —                    | 
Phoenix schema name.                                                            
                                         |
+| `-tenant` | `--tenant-id`         |    no    | —                    | Tenant 
id for tenant-specific sync.                                                    
                                  |
+| `-ft`     | `--from-time`         |    no    | `0`                  | Lower 
bound of the cell-timestamp window, in ms.                                      
                                   |
+| `-tt`     | `--to-time`           |    no    | `now - 1 hour`       | Upper 
bound; also used as `CURRENT_SCN`. The 1-hour buffer gives async replication 
time to catch up.                     |
+| `-cs`     | `--chunk-size`        |    no    | `1073741824` (1 GiB) | 
Approximate chunk size in bytes. Smaller chunks narrow the divergence search 
radius at the cost of more checkpoint rows. |
+| `-rs`     | `--raw-scan`          |    no    | `false`              | 
Include delete markers.                                                         
                                         |
+| `-rav`    | `--read-all-versions` |    no    | `false`              | 
Compare every cell version, not just the latest.                                
                                         |
+| `-coal`   | `--coalesce-split`    |    no    | `false`              | 
Coalesce multiple source regions into one mapper.                               
                                         |
+| `-runfg`  | `--run-foreground`    |    no    | `false`              | Block 
until the job completes (default is fire-and-forget submit).                    
                                   |
+| `-dr`     | `--dry-run`           |    no    | `false`              | Marker 
only — reserved for a future auto-repair extension.                             
                                  |
+| `-h`      | `--help`              |    no    | —                    | Print 
help and exit.                                                                  
                                   |
+
+The mapper count is implicitly the number of source-table regions (one
+mapper per region) unless `--coalesce-split` is set.
+
+## Output [#sync-table-output]
+
+### MapReduce counters
+
+When `--run-foreground` is set, the tool logs counters from the
+`PhoenixSyncTableMapper$SyncCounters` group:
+
+- `MAPPERS_VERIFIED`, `MAPPERS_MISMATCHED`
+- `CHUNKS_VERIFIED`, `CHUNKS_MISMATCHED`
+- `SOURCE_ROWS_PROCESSED`, `TARGET_ROWS_PROCESSED`
+
+### `PHOENIX_SYNC_TABLE_CHECKPOINT`
+
+The tool auto-creates a Phoenix system table on the **source** cluster (90-day

Review Comment:
   nit: checkpoint table is not a system table.



##########
app/pages/_docs/docs/_mdx/(multi-page)/features/phoenix-sync-table.mdx:
##########
@@ -0,0 +1,159 @@
+---
+title: "PhoenixSyncTable Tool"
+description: "Detect data divergence between a source and a target Phoenix 
table across two HBase clusters via a chunked hash comparison driven by 
MapReduce."
+---
+
+`PhoenixSyncTableTool` is a MapReduce-based divergence detector for Phoenix
+tables that are replicated (or migrated) between two HBase clusters. It
+compares chunks of source and target data without transferring full rows over
+the network and records any chunk whose hashes disagree to a Phoenix system
+table for later inspection. Available in Phoenix 5.3.1
+([PHOENIX-7751](https://issues.apache.org/jira/browse/PHOENIX-7751)).
+
+The tool is conceptually similar to HBase's `HashTable`/`SyncTable` pair but
+is Phoenix-aware (respects TTL, `CURRENT_SCN`, tenant id, indexes, and the
+column-encoding scheme) and runs as a **single** MapReduce job with no HDFS
+intermediate. Output is a Phoenix table, queryable with SQL.
+
+`PhoenixSyncTableTool` performs **detection only** in 5.3.1; it does not
+modify the target cluster.
+
+## When to use it [#sync-table-when]
+
+Reach for `PhoenixSyncTableTool` to verify:
+
+- A cluster migration that used HBase snapshots, replication, or both — to
+  confirm the target is byte-for-byte identical after cutover.
+- Long-running HBase replication — to detect cases where a replication peer
+  has silently drifted.
+- DR drills — to confirm the standby is in sync before a planned failover.
+
+For ad-hoc row-count or row-key spot-checks you usually want a small SQL
+query instead; `PhoenixSyncTableTool` is the right choice when you need
+**full-data** confidence with bounded network cost.
+
+## Running the tool [#sync-table-running]
+
+The tool runs through `hbase` (or `hadoop jar`) and takes only two mandatory
+flags — the source table name and the target cluster's ZooKeeper quorum.
+
+```bash
+hbase org.apache.phoenix.mapreduce.PhoenixSyncTableTool \
+  --table-name MY_SCHEMA.MY_TABLE \
+  --target-cluster zk1,zk2,zk3:2181:/hbase \
+  --run-foreground
+```
+
+The source cluster comes from the Hadoop/HBase configuration the job is
+submitted under, so `--target-cluster` is the ZooKeeper quorum of the
+**other** cluster. Accepted quorum formats:
+
+- `host:port:/znode`
+- `h1,h2:port:/znode`
+- `h1:p1,h2:p2:/znode`
+
+### Flags
+
+| Short     | Long                  | Required | Default              | 
Purpose                                                                         
                                         |
+| --------- | --------------------- | :------: | -------------------- | 
------------------------------------------------------------------------------------------------------------------------
 |
+| `-tn`     | `--table-name`        |   yes    | —                    | Source 
table (physical name; index physical names are also accepted).                  
                                  |
+| `-tc`     | `--target-cluster`    |   yes    | —                    | ZK 
quorum of the target cluster.                                                   
                                      |
+| `-s`      | `--schema`            |    no    | —                    | 
Phoenix schema name.                                                            
                                         |
+| `-tenant` | `--tenant-id`         |    no    | —                    | Tenant 
id for tenant-specific sync.                                                    
                                  |
+| `-ft`     | `--from-time`         |    no    | `0`                  | Lower 
bound of the cell-timestamp window, in ms.                                      
                                   |
+| `-tt`     | `--to-time`           |    no    | `now - 1 hour`       | Upper 
bound; also used as `CURRENT_SCN`. The 1-hour buffer gives async replication 
time to catch up.                     |
+| `-cs`     | `--chunk-size`        |    no    | `1073741824` (1 GiB) | 
Approximate chunk size in bytes. Smaller chunks narrow the divergence search 
radius at the cost of more checkpoint rows. |
+| `-rs`     | `--raw-scan`          |    no    | `false`              | 
Include delete markers.                                                         
                                         |
+| `-rav`    | `--read-all-versions` |    no    | `false`              | 
Compare every cell version, not just the latest.                                
                                         |
+| `-coal`   | `--coalesce-split`    |    no    | `false`              | 
Coalesce multiple source regions into one mapper.                               
                                         |
+| `-runfg`  | `--run-foreground`    |    no    | `false`              | Block 
until the job completes (default is fire-and-forget submit).                    
                                   |
+| `-dr`     | `--dry-run`           |    no    | `false`              | Marker 
only — reserved for a future auto-repair extension.                             
                                  |
+| `-h`      | `--help`              |    no    | —                    | Print 
help and exit.                                                                  
                                   |
+
+The mapper count is implicitly the number of source-table regions (one
+mapper per region) unless `--coalesce-split` is set.
+
+## Output [#sync-table-output]
+
+### MapReduce counters
+
+When `--run-foreground` is set, the tool logs counters from the
+`PhoenixSyncTableMapper$SyncCounters` group:
+
+- `MAPPERS_VERIFIED`, `MAPPERS_MISMATCHED`
+- `CHUNKS_VERIFIED`, `CHUNKS_MISMATCHED`
+- `SOURCE_ROWS_PROCESSED`, `TARGET_ROWS_PROCESSED`
+
+### `PHOENIX_SYNC_TABLE_CHECKPOINT`
+
+The tool auto-creates a Phoenix system table on the **source** cluster (90-day
+TTL, Snappy compression) with one row per chunk and per region. To list
+divergences from the last run:
+
+```sql
+SELECT START_ROW_KEY, END_ROW_KEY, COUNTERS, EXECUTION_END_TIME
+FROM   PHOENIX_SYNC_TABLE_CHECKPOINT
+WHERE  TABLE_NAME = 'MY_TABLE'
+  AND  TARGET_CLUSTER = 'zk1,zk2,zk3:2181:/hbase'
+  AND  TYPE = 'CHUNK'
+  AND  STATUS = 'MISMATCHED';
+```
+
+Each row carries `STATUS` (`VERIFIED` or `MISMATCHED`), `TYPE` (`CHUNK` or
+`REGION`), the key range, and a comma-separated `COUNTERS` string with
+per-chunk source and target row counts.
+
+### Resumability
+
+A re-run of the same `(table, target, from-time, to-time, tenant)` tuple
+picks up where the previous run left off — already-verified sub-ranges are
+skipped.
+
+## Prerequisites [#sync-table-prereqs]
+
+- **Cross-cluster line of sight.** Mapper YARN nodes need ZooKeeper and RPC
+  reachability to **both** clusters' RegionServers.
+- **Both clusters must run Phoenix 5.3.1+.**
+- **Live read, not snapshot-based.** Both clusters are scanned through the
+  regular Phoenix read path.
+- **Kerberos** delegation tokens for the target cluster are acquired
+  automatically when security is enabled.
+- The submitter principal needs `READ` on the physical HBase tables on both
+  clusters, plus `WRITE` to `PHOENIX_SYNC_TABLE_CHECKPOINT` on the source.
+- Views and logical (not physical) index names are rejected. Pass the
+  physical index table name to validate an index.
+
+## Tuning [#sync-table-tuning]
+
+`--chunk-size` is the main lever:
+
+- Larger chunks (e.g. 4 GiB) reduce checkpoint rows and per-chunk overhead
+  but make every mismatch report a coarser range.
+- Smaller chunks (e.g. 64 MiB) narrow the mismatch search radius and produce
+  more checkpoint rows.
+
+The tool runs at long-scan timescales. Adjust these client-side timeouts
+(set in the Hadoop `Configuration` the job is submitted with) if you see
+scanner timeouts on very large regions:
+
+| Property                                    | Default      |
+| ------------------------------------------- | ------------ |
+| `phoenix.sync.table.query.timeout`          | ~150 minutes |
+| `phoenix.sync.table.rpc.timeout`            | 30 minutes   |
+| `phoenix.sync.table.client.scanner.timeout` | 30 minutes   |
+| `phoenix.sync.table.rpc.retries.counter`    | 5            |
+
+## Limitations [#sync-table-limitations]
+
+- **Detection only.** Mismatched chunks are recorded but not repaired in
+  5.3.1. `--dry-run` is a marker reserved for a future auto-repair pass.
+- **No views.** Only physical tables and index physical names are accepted.
+- The default `--to-time` is `now - 1 hour`. To compare data written less
+  than one hour ago, pass an explicit `--to-time`.

Review Comment:
   Would it be okay to add an upcoming section as well ?
   This could cover Repair Phase of the tool.
    
   Limitation section covers that part and hinting towards future auto repair, 
maybe we can be explicit ?



##########
app/pages/_docs/docs/_mdx/(multi-page)/features/phoenix-sync-table.mdx:
##########
@@ -0,0 +1,159 @@
+---
+title: "PhoenixSyncTable Tool"
+description: "Detect data divergence between a source and a target Phoenix 
table across two HBase clusters via a chunked hash comparison driven by 
MapReduce."
+---
+
+`PhoenixSyncTableTool` is a MapReduce-based divergence detector for Phoenix
+tables that are replicated (or migrated) between two HBase clusters. It
+compares chunks of source and target data without transferring full rows over
+the network and records any chunk whose hashes disagree to a Phoenix system
+table for later inspection. Available in Phoenix 5.3.1
+([PHOENIX-7751](https://issues.apache.org/jira/browse/PHOENIX-7751)).
+
+The tool is conceptually similar to HBase's `HashTable`/`SyncTable` pair but
+is Phoenix-aware (respects TTL, `CURRENT_SCN`, tenant id, indexes, and the
+column-encoding scheme) and runs as a **single** MapReduce job with no HDFS
+intermediate. Output is a Phoenix table, queryable with SQL.
+
+`PhoenixSyncTableTool` performs **detection only** in 5.3.1; it does not
+modify the target cluster.
+
+## When to use it [#sync-table-when]
+
+Reach for `PhoenixSyncTableTool` to verify:
+
+- A cluster migration that used HBase snapshots, replication, or both — to
+  confirm the target is byte-for-byte identical after cutover.
+- Long-running HBase replication — to detect cases where a replication peer
+  has silently drifted.
+- DR drills — to confirm the standby is in sync before a planned failover.
+
+For ad-hoc row-count or row-key spot-checks you usually want a small SQL
+query instead; `PhoenixSyncTableTool` is the right choice when you need
+**full-data** confidence with bounded network cost.
+
+## Running the tool [#sync-table-running]
+
+The tool runs through `hbase` (or `hadoop jar`) and takes only two mandatory
+flags — the source table name and the target cluster's ZooKeeper quorum.
+
+```bash
+hbase org.apache.phoenix.mapreduce.PhoenixSyncTableTool \
+  --table-name MY_SCHEMA.MY_TABLE \
+  --target-cluster zk1,zk2,zk3:2181:/hbase \
+  --run-foreground
+```
+
+The source cluster comes from the Hadoop/HBase configuration the job is
+submitted under, so `--target-cluster` is the ZooKeeper quorum of the
+**other** cluster. Accepted quorum formats:
+
+- `host:port:/znode`
+- `h1,h2:port:/znode`
+- `h1:p1,h2:p2:/znode`
+
+### Flags
+
+| Short     | Long                  | Required | Default              | 
Purpose                                                                         
                                         |
+| --------- | --------------------- | :------: | -------------------- | 
------------------------------------------------------------------------------------------------------------------------
 |
+| `-tn`     | `--table-name`        |   yes    | —                    | Source 
table (physical name; index physical names are also accepted).                  
                                  |
+| `-tc`     | `--target-cluster`    |   yes    | —                    | ZK 
quorum of the target cluster.                                                   
                                      |
+| `-s`      | `--schema`            |    no    | —                    | 
Phoenix schema name.                                                            
                                         |
+| `-tenant` | `--tenant-id`         |    no    | —                    | Tenant 
id for tenant-specific sync.                                                    
                                  |
+| `-ft`     | `--from-time`         |    no    | `0`                  | Lower 
bound of the cell-timestamp window, in ms.                                      
                                   |
+| `-tt`     | `--to-time`           |    no    | `now - 1 hour`       | Upper 
bound; also used as `CURRENT_SCN`. The 1-hour buffer gives async replication 
time to catch up.                     |
+| `-cs`     | `--chunk-size`        |    no    | `1073741824` (1 GiB) | 
Approximate chunk size in bytes. Smaller chunks narrow the divergence search 
radius at the cost of more checkpoint rows. |
+| `-rs`     | `--raw-scan`          |    no    | `false`              | 
Include delete markers.                                                         
                                         |
+| `-rav`    | `--read-all-versions` |    no    | `false`              | 
Compare every cell version, not just the latest.                                
                                         |
+| `-coal`   | `--coalesce-split`    |    no    | `false`              | 
Coalesce multiple source regions into one mapper.                               
                                         |
+| `-runfg`  | `--run-foreground`    |    no    | `false`              | Block 
until the job completes (default is fire-and-forget submit).                    
                                   |
+| `-dr`     | `--dry-run`           |    no    | `false`              | Marker 
only — reserved for a future auto-repair extension.                             
                                  |
+| `-h`      | `--help`              |    no    | —                    | Print 
help and exit.                                                                  
                                   |
+
+The mapper count is implicitly the number of source-table regions (one
+mapper per region) unless `--coalesce-split` is set.
+
+## Output [#sync-table-output]
+
+### MapReduce counters
+
+When `--run-foreground` is set, the tool logs counters from the
+`PhoenixSyncTableMapper$SyncCounters` group:
+
+- `MAPPERS_VERIFIED`, `MAPPERS_MISMATCHED`
+- `CHUNKS_VERIFIED`, `CHUNKS_MISMATCHED`
+- `SOURCE_ROWS_PROCESSED`, `TARGET_ROWS_PROCESSED`
+
+### `PHOENIX_SYNC_TABLE_CHECKPOINT`
+
+The tool auto-creates a Phoenix system table on the **source** cluster (90-day
+TTL, Snappy compression) with one row per chunk and per region. To list
+divergences from the last run:
+
+```sql
+SELECT START_ROW_KEY, END_ROW_KEY, COUNTERS, EXECUTION_END_TIME
+FROM   PHOENIX_SYNC_TABLE_CHECKPOINT
+WHERE  TABLE_NAME = 'MY_TABLE'
+  AND  TARGET_CLUSTER = 'zk1,zk2,zk3:2181:/hbase'
+  AND  TYPE = 'CHUNK'
+  AND  STATUS = 'MISMATCHED';
+```
+
+Each row carries `STATUS` (`VERIFIED` or `MISMATCHED`), `TYPE` (`CHUNK` or
+`REGION`), the key range, and a comma-separated `COUNTERS` string with
+per-chunk source and target row counts.
+
+### Resumability
+
+A re-run of the same `(table, target, from-time, to-time, tenant)` tuple
+picks up where the previous run left off — already-verified sub-ranges are
+skipped.
+
+## Prerequisites [#sync-table-prereqs]
+
+- **Cross-cluster line of sight.** Mapper YARN nodes need ZooKeeper and RPC
+  reachability to **both** clusters' RegionServers.
+- **Both clusters must run Phoenix 5.3.1+.**
+- **Live read, not snapshot-based.** Both clusters are scanned through the
+  regular Phoenix read path.
+- **Kerberos** delegation tokens for the target cluster are acquired
+  automatically when security is enabled.
+- The submitter principal needs `READ` on the physical HBase tables on both
+  clusters, plus `WRITE` to `PHOENIX_SYNC_TABLE_CHECKPOINT` on the source.
+- Views and logical (not physical) index names are rejected. Pass the
+  physical index table name to validate an index.
+
+## Tuning [#sync-table-tuning]
+
+`--chunk-size` is the main lever:
+
+- Larger chunks (e.g. 4 GiB) reduce checkpoint rows and per-chunk overhead
+  but make every mismatch report a coarser range.
+- Smaller chunks (e.g. 64 MiB) narrow the mismatch search radius and produce
+  more checkpoint rows.
+
+The tool runs at long-scan timescales. Adjust these client-side timeouts
+(set in the Hadoop `Configuration` the job is submitted with) if you see
+scanner timeouts on very large regions:
+
+| Property                                    | Default      |
+| ------------------------------------------- | ------------ |
+| `phoenix.sync.table.query.timeout`          | ~150 minutes |
+| `phoenix.sync.table.rpc.timeout`            | 30 minutes   |
+| `phoenix.sync.table.client.scanner.timeout` | 30 minutes   |
+| `phoenix.sync.table.rpc.retries.counter`    | 5            |
+
+## Limitations [#sync-table-limitations]
+
+- **Detection only.** Mismatched chunks are recorded but not repaired in
+  5.3.1. `--dry-run` is a marker reserved for a future auto-repair pass.
+- **No views.** Only physical tables and index physical names are accepted.
+- The default `--to-time` is `now - 1 hour`. To compare data written less
+  than one hour ago, pass an explicit `--to-time`.
+
+## See also [#sync-table-see-also]

Review Comment:
   These do not seem to be co-related, maybe we can remove ?



##########
app/pages/_docs/docs/_mdx/(multi-page)/features/phoenix-sync-table.mdx:
##########
@@ -0,0 +1,159 @@
+---
+title: "PhoenixSyncTable Tool"
+description: "Detect data divergence between a source and a target Phoenix 
table across two HBase clusters via a chunked hash comparison driven by 
MapReduce."
+---
+
+`PhoenixSyncTableTool` is a MapReduce-based divergence detector for Phoenix
+tables that are replicated (or migrated) between two HBase clusters. It
+compares chunks of source and target data without transferring full rows over
+the network and records any chunk whose hashes disagree to a Phoenix system
+table for later inspection. Available in Phoenix 5.3.1
+([PHOENIX-7751](https://issues.apache.org/jira/browse/PHOENIX-7751)).
+
+The tool is conceptually similar to HBase's `HashTable`/`SyncTable` pair but
+is Phoenix-aware (respects TTL, `CURRENT_SCN`, tenant id, indexes, and the
+column-encoding scheme) and runs as a **single** MapReduce job with no HDFS
+intermediate. Output is a Phoenix table, queryable with SQL.
+
+`PhoenixSyncTableTool` performs **detection only** in 5.3.1; it does not
+modify the target cluster.
+
+## When to use it [#sync-table-when]
+
+Reach for `PhoenixSyncTableTool` to verify:
+
+- A cluster migration that used HBase snapshots, replication, or both — to
+  confirm the target is byte-for-byte identical after cutover.
+- Long-running HBase replication — to detect cases where a replication peer
+  has silently drifted.
+- DR drills — to confirm the standby is in sync before a planned failover.
+
+For ad-hoc row-count or row-key spot-checks you usually want a small SQL
+query instead; `PhoenixSyncTableTool` is the right choice when you need
+**full-data** confidence with bounded network cost.
+
+## Running the tool [#sync-table-running]
+
+The tool runs through `hbase` (or `hadoop jar`) and takes only two mandatory
+flags — the source table name and the target cluster's ZooKeeper quorum.
+
+```bash
+hbase org.apache.phoenix.mapreduce.PhoenixSyncTableTool \
+  --table-name MY_SCHEMA.MY_TABLE \
+  --target-cluster zk1,zk2,zk3:2181:/hbase \
+  --run-foreground
+```
+
+The source cluster comes from the Hadoop/HBase configuration the job is
+submitted under, so `--target-cluster` is the ZooKeeper quorum of the
+**other** cluster. Accepted quorum formats:
+
+- `host:port:/znode`
+- `h1,h2:port:/znode`
+- `h1:p1,h2:p2:/znode`
+
+### Flags
+
+| Short     | Long                  | Required | Default              | 
Purpose                                                                         
                                         |
+| --------- | --------------------- | :------: | -------------------- | 
------------------------------------------------------------------------------------------------------------------------
 |
+| `-tn`     | `--table-name`        |   yes    | —                    | Source 
table (physical name; index physical names are also accepted).                  
                                  |
+| `-tc`     | `--target-cluster`    |   yes    | —                    | ZK 
quorum of the target cluster.                                                   
                                      |
+| `-s`      | `--schema`            |    no    | —                    | 
Phoenix schema name.                                                            
                                         |
+| `-tenant` | `--tenant-id`         |    no    | —                    | Tenant 
id for tenant-specific sync.                                                    
                                  |
+| `-ft`     | `--from-time`         |    no    | `0`                  | Lower 
bound of the cell-timestamp window, in ms.                                      
                                   |
+| `-tt`     | `--to-time`           |    no    | `now - 1 hour`       | Upper 
bound; also used as `CURRENT_SCN`. The 1-hour buffer gives async replication 
time to catch up.                     |
+| `-cs`     | `--chunk-size`        |    no    | `1073741824` (1 GiB) | 
Approximate chunk size in bytes. Smaller chunks narrow the divergence search 
radius at the cost of more checkpoint rows. |
+| `-rs`     | `--raw-scan`          |    no    | `false`              | 
Include delete markers.                                                         
                                         |
+| `-rav`    | `--read-all-versions` |    no    | `false`              | 
Compare every cell version, not just the latest.                                
                                         |
+| `-coal`   | `--coalesce-split`    |    no    | `false`              | 
Coalesce multiple source regions into one mapper.                               
                                         |
+| `-runfg`  | `--run-foreground`    |    no    | `false`              | Block 
until the job completes (default is fire-and-forget submit).                    
                                   |
+| `-dr`     | `--dry-run`           |    no    | `false`              | Marker 
only — reserved for a future auto-repair extension.                             
                                  |
+| `-h`      | `--help`              |    no    | —                    | Print 
help and exit.                                                                  
                                   |
+
+The mapper count is implicitly the number of source-table regions (one
+mapper per region) unless `--coalesce-split` is set.
+
+## Output [#sync-table-output]
+
+### MapReduce counters
+
+When `--run-foreground` is set, the tool logs counters from the
+`PhoenixSyncTableMapper$SyncCounters` group:
+
+- `MAPPERS_VERIFIED`, `MAPPERS_MISMATCHED`
+- `CHUNKS_VERIFIED`, `CHUNKS_MISMATCHED`
+- `SOURCE_ROWS_PROCESSED`, `TARGET_ROWS_PROCESSED`
+
+### `PHOENIX_SYNC_TABLE_CHECKPOINT`
+
+The tool auto-creates a Phoenix system table on the **source** cluster (90-day
+TTL, Snappy compression) with one row per chunk and per region. To list
+divergences from the last run:
+
+```sql
+SELECT START_ROW_KEY, END_ROW_KEY, COUNTERS, EXECUTION_END_TIME
+FROM   PHOENIX_SYNC_TABLE_CHECKPOINT
+WHERE  TABLE_NAME = 'MY_TABLE'
+  AND  TARGET_CLUSTER = 'zk1,zk2,zk3:2181:/hbase'
+  AND  TYPE = 'CHUNK'
+  AND  STATUS = 'MISMATCHED';
+```
+
+Each row carries `STATUS` (`VERIFIED` or `MISMATCHED`), `TYPE` (`CHUNK` or
+`REGION`), the key range, and a comma-separated `COUNTERS` string with
+per-chunk source and target row counts.
+
+### Resumability
+
+A re-run of the same `(table, target, from-time, to-time, tenant)` tuple
+picks up where the previous run left off — already-verified sub-ranges are
+skipped.
+
+## Prerequisites [#sync-table-prereqs]
+
+- **Cross-cluster line of sight.** Mapper YARN nodes need ZooKeeper and RPC
+  reachability to **both** clusters' RegionServers.
+- **Both clusters must run Phoenix 5.3.1+.**

Review Comment:
   Though this feature has bee imported to 5.2 as well, but not sure if we are 
advertising it as new 5.3 feature ? Okay to go with how other backported 
features are being published.



##########
app/pages/_docs/docs/_mdx/(multi-page)/features/phoenix-sync-table.mdx:
##########
@@ -0,0 +1,159 @@
+---
+title: "PhoenixSyncTable Tool"
+description: "Detect data divergence between a source and a target Phoenix 
table across two HBase clusters via a chunked hash comparison driven by 
MapReduce."
+---
+
+`PhoenixSyncTableTool` is a MapReduce-based divergence detector for Phoenix
+tables that are replicated (or migrated) between two HBase clusters. It
+compares chunks of source and target data without transferring full rows over
+the network and records any chunk whose hashes disagree to a Phoenix system
+table for later inspection. Available in Phoenix 5.3.1
+([PHOENIX-7751](https://issues.apache.org/jira/browse/PHOENIX-7751)).
+
+The tool is conceptually similar to HBase's `HashTable`/`SyncTable` pair but
+is Phoenix-aware (respects TTL, `CURRENT_SCN`, tenant id, indexes, and the
+column-encoding scheme) and runs as a **single** MapReduce job with no HDFS
+intermediate. Output is a Phoenix table, queryable with SQL.
+
+`PhoenixSyncTableTool` performs **detection only** in 5.3.1; it does not
+modify the target cluster.
+
+## When to use it [#sync-table-when]
+
+Reach for `PhoenixSyncTableTool` to verify:
+
+- A cluster migration that used HBase snapshots, replication, or both — to
+  confirm the target is byte-for-byte identical after cutover.
+- Long-running HBase replication — to detect cases where a replication peer
+  has silently drifted.
+- DR drills — to confirm the standby is in sync before a planned failover.
+
+For ad-hoc row-count or row-key spot-checks you usually want a small SQL
+query instead; `PhoenixSyncTableTool` is the right choice when you need
+**full-data** confidence with bounded network cost.
+
+## Running the tool [#sync-table-running]
+
+The tool runs through `hbase` (or `hadoop jar`) and takes only two mandatory
+flags — the source table name and the target cluster's ZooKeeper quorum.
+
+```bash
+hbase org.apache.phoenix.mapreduce.PhoenixSyncTableTool \
+  --table-name MY_SCHEMA.MY_TABLE \
+  --target-cluster zk1,zk2,zk3:2181:/hbase \
+  --run-foreground
+```
+
+The source cluster comes from the Hadoop/HBase configuration the job is
+submitted under, so `--target-cluster` is the ZooKeeper quorum of the
+**other** cluster. Accepted quorum formats:
+
+- `host:port:/znode`
+- `h1,h2:port:/znode`
+- `h1:p1,h2:p2:/znode`
+
+### Flags
+
+| Short     | Long                  | Required | Default              | 
Purpose                                                                         
                                         |
+| --------- | --------------------- | :------: | -------------------- | 
------------------------------------------------------------------------------------------------------------------------
 |
+| `-tn`     | `--table-name`        |   yes    | —                    | Source 
table (physical name; index physical names are also accepted).                  
                                  |
+| `-tc`     | `--target-cluster`    |   yes    | —                    | ZK 
quorum of the target cluster.                                                   
                                      |
+| `-s`      | `--schema`            |    no    | —                    | 
Phoenix schema name.                                                            
                                         |
+| `-tenant` | `--tenant-id`         |    no    | —                    | Tenant 
id for tenant-specific sync.                                                    
                                  |
+| `-ft`     | `--from-time`         |    no    | `0`                  | Lower 
bound of the cell-timestamp window, in ms.                                      
                                   |
+| `-tt`     | `--to-time`           |    no    | `now - 1 hour`       | Upper 
bound; also used as `CURRENT_SCN`. The 1-hour buffer gives async replication 
time to catch up.                     |
+| `-cs`     | `--chunk-size`        |    no    | `1073741824` (1 GiB) | 
Approximate chunk size in bytes. Smaller chunks narrow the divergence search 
radius at the cost of more checkpoint rows. |
+| `-rs`     | `--raw-scan`          |    no    | `false`              | 
Include delete markers.                                                         
                                         |
+| `-rav`    | `--read-all-versions` |    no    | `false`              | 
Compare every cell version, not just the latest.                                
                                         |
+| `-coal`   | `--coalesce-split`    |    no    | `false`              | 
Coalesce multiple source regions into one mapper.                               
                                         |
+| `-runfg`  | `--run-foreground`    |    no    | `false`              | Block 
until the job completes (default is fire-and-forget submit).                    
                                   |
+| `-dr`     | `--dry-run`           |    no    | `false`              | Marker 
only — reserved for a future auto-repair extension.                             
                                  |
+| `-h`      | `--help`              |    no    | —                    | Print 
help and exit.                                                                  
                                   |
+
+The mapper count is implicitly the number of source-table regions (one
+mapper per region) unless `--coalesce-split` is set.
+
+## Output [#sync-table-output]
+
+### MapReduce counters
+
+When `--run-foreground` is set, the tool logs counters from the
+`PhoenixSyncTableMapper$SyncCounters` group:
+
+- `MAPPERS_VERIFIED`, `MAPPERS_MISMATCHED`
+- `CHUNKS_VERIFIED`, `CHUNKS_MISMATCHED`
+- `SOURCE_ROWS_PROCESSED`, `TARGET_ROWS_PROCESSED`
+
+### `PHOENIX_SYNC_TABLE_CHECKPOINT`
+
+The tool auto-creates a Phoenix system table on the **source** cluster (90-day
+TTL, Snappy compression) with one row per chunk and per region. To list
+divergences from the last run:
+
+```sql
+SELECT START_ROW_KEY, END_ROW_KEY, COUNTERS, EXECUTION_END_TIME
+FROM   PHOENIX_SYNC_TABLE_CHECKPOINT
+WHERE  TABLE_NAME = 'MY_TABLE'
+  AND  TARGET_CLUSTER = 'zk1,zk2,zk3:2181:/hbase'
+  AND  TYPE = 'CHUNK'
+  AND  STATUS = 'MISMATCHED';
+```
+
+Each row carries `STATUS` (`VERIFIED` or `MISMATCHED`), `TYPE` (`CHUNK` or
+`REGION`), the key range, and a comma-separated `COUNTERS` string with
+per-chunk source and target row counts.
+
+### Resumability
+
+A re-run of the same `(table, target, from-time, to-time, tenant)` tuple
+picks up where the previous run left off — already-verified sub-ranges are
+skipped.
+
+## Prerequisites [#sync-table-prereqs]
+
+- **Cross-cluster line of sight.** Mapper YARN nodes need ZooKeeper and RPC
+  reachability to **both** clusters' RegionServers.
+- **Both clusters must run Phoenix 5.3.1+.**
+- **Live read, not snapshot-based.** Both clusters are scanned through the
+  regular Phoenix read path.
+- **Kerberos** delegation tokens for the target cluster are acquired
+  automatically when security is enabled.
+- The submitter principal needs `READ` on the physical HBase tables on both
+  clusters, plus `WRITE` to `PHOENIX_SYNC_TABLE_CHECKPOINT` on the source.
+- Views and logical (not physical) index names are rejected. Pass the
+  physical index table name to validate an index.
+
+## Tuning [#sync-table-tuning]
+
+`--chunk-size` is the main lever:
+
+- Larger chunks (e.g. 4 GiB) reduce checkpoint rows and per-chunk overhead
+  but make every mismatch report a coarser range.
+- Smaller chunks (e.g. 64 MiB) narrow the mismatch search radius and produce
+  more checkpoint rows.
+
+The tool runs at long-scan timescales. Adjust these client-side timeouts
+(set in the Hadoop `Configuration` the job is submitted with) if you see
+scanner timeouts on very large regions:
+
+| Property                                    | Default      |
+| ------------------------------------------- | ------------ |
+| `phoenix.sync.table.query.timeout`          | ~150 minutes |
+| `phoenix.sync.table.rpc.timeout`            | 30 minutes   |
+| `phoenix.sync.table.client.scanner.timeout` | 30 minutes   |
+| `phoenix.sync.table.rpc.retries.counter`    | 5            |
+
+## Limitations [#sync-table-limitations]
+
+- **Detection only.** Mismatched chunks are recorded but not repaired in
+  5.3.1. `--dry-run` is a marker reserved for a future auto-repair pass.
+- **No views.** Only physical tables and index physical names are accepted.
+- The default `--to-time` is `now - 1 hour`. To compare data written less

Review Comment:
   This doesn't look like a limitation to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] PHOENIX-7800 : Add Phoenix 5.3.1 release and features [phoenix-site]

Reply via email to