Enis Soztutar created HBASE-9797:
------------------------------------

             Summary: Multi row transactions are not atomic for scanners
                 Key: HBASE-9797
                 URL: https://issues.apache.org/jira/browse/HBASE-9797
             Project: HBase
          Issue Type: Bug
            Reporter: Enis Soztutar


Multi row atomic puts, as implemented by the coprocessor API is atomic for gets 
and multi gets, but not so much for scanners. 

mvcc read point, as of today, is only kept in RS memory. When a client starts 
the scan, we create a new scanner object and save the mvcc read point of the 
scan there. Since the scan API is row-based, the scan results are only made 
visible to clients row-per-row, and the client scanner keep track of the last 
row seen. 

So, for a multi-row atomic update, the scanner might get an mvcc number which 
is less than the commit point of the multi-row update, so it will skip some 
rows in the scan (will not see the rows). However, in case of RS failover, a 
new scanner will be created which will have a mvcc read number larger than the 
multi-row update commit number. So the scanner will see the remaining rows from 
the transaction. 

Example: 
{code}
multi put : { {row1, c1, v1}, {row100, c1, v100} } mvcc write number = 2
scan : scan from row1 to row100  mvcc read number = 1
{code}

scanner will not see row1. If RS fails before scanner reaches row100, the new 
scanner will get mvcc read number > 2, so it will see row100. 


There might be a couple of ways to fix this. First approach (as suggested by 
Sergey) is that we can wrap the Scanner into an atomic scanner implementation, 
which will restart the scan in case of a socket timeout or server failure, etc. 
This will batch up the results so that the rows are not visible. For small 
scans (like meta) this might be viable. 


The second way to properly fix this is, first finish up the patch at 
HBASE-8763, then change the scanner to obtain an mvcc number from the RS in 
scanner open, and save the mvcc number in the client side. Upon failure, the 
scanner will continue the scan where it is left. We have to keep the low 
watermark (the smallest mvcc read number of the scanners currently open) 
differently. Currently that number is already tracked, but not across RS 
failover. We can do timeouts to manage the low watermark I think. 
This approach also enables us to implement cell-based streaming scan instead of 
row-based approach we have today. 

Opened the issue, so that it is tracked. Feel free to pick it up if you like. 











--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to