Viraj Jasani created PHOENIX-7106:
-------------------------------------

             Summary: Invalid rowkey returned by coproc can cause data 
integrity issues
                 Key: PHOENIX-7106
                 URL: https://issues.apache.org/jira/browse/PHOENIX-7106
             Project: Phoenix
          Issue Type: Improvement
            Reporter: Viraj Jasani
            Assignee: Viraj Jasani


HBase scanner interface expects server to perform scan of the cells from HFile 
or Block cache and return consistent data i.e. rowkey of the cells returned 
should stay in the range of the scan boundaries. When a region moves and 
scanner needs reset, or if the current row is too large and the server returns 
partial row, the subsequent scanner#next is supposed to return remaining cells. 
When this happens, cell rowkeys returned by servers i.e. any coprocessors is 
expected to be in the scan boundary range so that server can reliably perform 
its validation and return remaining cells as expected.

Phoenix client initiates serial or parallel scans from the aggregators based on 
the region boundaries and the scan boundaries are sometimes adjusted based on 
where optimizer provided key ranges, to include tenant boundaries, salt 
boundaries etc. After the client opens the scanner and performs scan operation, 
some of the coprocs return invalid rowkey for the following cases:
 # Grouped aggregate queries
 # Ungrouped aggregate queries (not all of them)
 # Offset queries
 # Some dummy cells returned with empty rowkey
 # Update statistics queries
 # Local indexes

Since many of these cases return reserved rowkeys, they are likely not going to 
match scan or region boundaries. It has potential to cause data integrity 
issues in certain scenarios as explained above. Empty rowkey returned by server 
can be treated as end of the region scan by HBase client.
With the paging feature enabled, if the page size is kept low, we have higher 
chances of scanners returning dummy cell, resulting in increased num of RPC 
calls for better latency and timeouts. We should return only valid rowkey in 
the scan range for all the cases where we perform above mentioned operations 
like complex aggregate or offset queries etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to