[ 
https://issues.apache.org/jira/browse/HBASE-8701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679885#comment-13679885
 ] 

Sergey Shelukhin commented on HBASE-8701:
-----------------------------------------

bq. Where? Thanks.
page 8: "We avoid duplicating log reads by first sort- ing the commit log 
entries in order of the keys ⟨table, row name, log sequence number⟩. In the 
sorted output, all mutations for a particular tablet are contiguous and can 
therefore be read efficiently with one disk seek followed by a sequential read. 
To parallelize the sorting, we partition the log file into 64 MB seg- ments, 
and sort each segment in parallel on different tablet servers."

Not clear if these outputs are ready to load or need to be replayed, but it 
should be ok to do the former.

bq. Because I do not understand how it works. I give an outline above of my 
understanding and going bq. Where?
by it, I conclude it too complex. Help me understand better. Thanks.
How does HalfStoreFileReader work right now - by having reference and splitkey.
This thing can either have references, with multiple splitkey support; or, for 
a more involved solution w/o references, have (in the beginning, or tail) an 
index that points to precise locations to where each file's data starts. 
However in the latter case it's not clear where to store the file, as you said.

bq. We'd associate seqid w/ kvs throughtout the system so we could do 
distributed log replay? 
That would also allow things like picking files arbitrarily for compactions, 
for example.
bq. Associating seqid w/ keyvalue is a radical reworking of internals that has 
been punted on in the past because the size of the work involved was thought 
too large; the base KV would have to change as would how we carry edits in 
memstore, our policy incrementing/upserting when thousands received a second, 
and we'd then have to redo how we currently persist seqid in hfiles on way in 
and out. If a seqid, do we need a mvcc or should they be related and if not, 
how should they be related? And so on.
I might be missing something... can you elaborate why? Right now we use seqid, 
taken from the file metadata, as a last-ditch conflict resolution mechanism 
(see KeyValueHeap::KVScannerComparator) after timestamps and stuff.
We can do the same but take seqid from KV instead of the file. Granted, if 
resolving identical keys it's a pretty large number of bytes to store...
The thing where memstore overwrites the value appears to be plain incorrect to 
me when VERSIONS is more than 1 (whether you will get a version or not depends 
on when the memstore flush happens), so that's an additional (and automatic?) 
advantage.
We don't necessarily have to change anything else, what do you have in mind?


                
> distributedLogReplay need to apply wal edits in the receiving order of those 
> edits
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-8701
>                 URL: https://issues.apache.org/jira/browse/HBASE-8701
>             Project: HBase
>          Issue Type: Bug
>          Components: MTTR
>            Reporter: Jeffrey Zhong
>            Assignee: Jeffrey Zhong
>             Fix For: 0.98.0, 0.95.2
>
>
> This issue happens in distributedLogReplay mode when recovering multiple puts 
> of the same key + version(timestamp). After replay, the value is 
> nondeterministic of the key
> h5. The original concern situation raised from [~eclark]:
> For all edits the rowkey is the same.
> There's a log with: [ A (ts = 0), B (ts = 0) ]
> Replay the first half of the log.
> A user puts in C (ts = 0)
> Memstore has to flush
> A new Hfile will be created with [ C, A ] and MaxSequenceId = C's seqid.
> Replay the rest of the Log.
> Flush
> The issue will happen in similar situation like Put(key, t=T) in WAL1 and 
> Put(key,t=T) in WAL2
> h5. Below is the option I'd like to use:
> a) During replay, we pass wal file name hash in each replay batch and 
> original wal sequence id of each edit to the receiving RS
> b) Once a wal is recovered, playing RS send a signal to the receiving RS so 
> the receiving RS can flush
> c) In receiving RS, different WAL file of a region sends edits to different 
> memstores.(We can visualize this in high level as sending changes to a new 
> region object with name(origin region name + wal name hash) and use the 
> original sequence Ids.) 
> d) writes from normal traffic(allow writes during recovery) are put in normal 
> memstores as of today and flush normally with new sequenceIds.
> h5. The other alternative options are listed below for references:
> Option one
> a) disallow writes during recovery
> b) during replay, we pass original wal sequence ids
> c) hold flush till all wals of a recovering region are replayed. Memstore 
> should hold because we only recover unflushed wal edits. For edits with same 
> key + version, whichever with larger sequence Id wins.
> Option two
> a) During replay, we pass original wal sequence ids
> b) for each wal edit, we store each edit's original sequence id along with 
> its key. 
> c) during scanning, we use the original sequence id if it's present otherwise 
> its store file sequence Id
> d) compaction can just leave put with max sequence id
> Please let me know if you have better ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to