Enis Soztutar created HBASE-9797:
------------------------------------
Summary: Multi row transactions are not atomic for scanners
Key: HBASE-9797
URL: https://issues.apache.org/jira/browse/HBASE-9797
Project: HBase
Issue Type: Bug
Reporter: Enis Soztutar
Multi row atomic puts, as implemented by the coprocessor API is atomic for gets
and multi gets, but not so much for scanners.
mvcc read point, as of today, is only kept in RS memory. When a client starts
the scan, we create a new scanner object and save the mvcc read point of the
scan there. Since the scan API is row-based, the scan results are only made
visible to clients row-per-row, and the client scanner keep track of the last
row seen.
So, for a multi-row atomic update, the scanner might get an mvcc number which
is less than the commit point of the multi-row update, so it will skip some
rows in the scan (will not see the rows). However, in case of RS failover, a
new scanner will be created which will have a mvcc read number larger than the
multi-row update commit number. So the scanner will see the remaining rows from
the transaction.
Example:
{code}
multi put : { {row1, c1, v1}, {row100, c1, v100} } mvcc write number = 2
scan : scan from row1 to row100 mvcc read number = 1
{code}
scanner will not see row1. If RS fails before scanner reaches row100, the new
scanner will get mvcc read number > 2, so it will see row100.
There might be a couple of ways to fix this. First approach (as suggested by
Sergey) is that we can wrap the Scanner into an atomic scanner implementation,
which will restart the scan in case of a socket timeout or server failure, etc.
This will batch up the results so that the rows are not visible. For small
scans (like meta) this might be viable.
The second way to properly fix this is, first finish up the patch at
HBASE-8763, then change the scanner to obtain an mvcc number from the RS in
scanner open, and save the mvcc number in the client side. Upon failure, the
scanner will continue the scan where it is left. We have to keep the low
watermark (the smallest mvcc read number of the scanners currently open)
differently. Currently that number is already tracked, but not across RS
failover. We can do timeouts to manage the low watermark I think.
This approach also enables us to implement cell-based streaming scan instead of
row-based approach we have today.
Opened the issue, so that it is tracked. Feel free to pick it up if you like.
--
This message was sent by Atlassian JIRA
(v6.1#6144)