[ 
https://issues.apache.org/jira/browse/HBASE-6774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13645599#comment-13645599
 ] 

Himanshu Vashishtha commented on HBASE-6774:
--------------------------------------------

Hey Enis,

Thanks for asking these questions.

There is a *max_completeSequenceId* per regionserver field in the attached doc, 
which is updated after receiving the heartbeat from a regionserver. When master 
processes the server shutdown event, it will use the max_completeSequenceId for 
the regionserver in order to determine how much WAL is relevant (it has missed) 
and need to read before finalizing allWALEntriesFlushed. The goal is to process 
all WALEdits which have walEdit#key#logSequenceId > max_completeSequenceId. If 
that means reading second last WAL also, it will process that too. The 
invariant is to read latest WAL files first, until we reach the point where 
some waledits in the wal are s.t. WALedit#key#logSequenceId < 
max_completeSequenceId. We no longer need to read older WALs then. 

bq. If a region has not got any update for some time, its 
latestCompleteFlushSeqId wont be updated at all, since there will be no 
flushes. To reassign this region, we have to ensure that all wals are read. 

It uses max_completeSequenceId to read the remaining WAL. Once it has read all 
the WALEdits after max_completeSequenceId, allWALEntriesFlushed will have the 
correct information, and it can be used to assign a region or not. 


bq. The only reliable way is to read up the wal backwards, 
I am not sure whether a sequenceFile can be read backwards, or how efficient it 
would be. That's why I propose to read a WAL file from its head and re-use the 
existing WALReader code.

As soon as any region is flushed, master will have the most updated information 
for all regions for that regionserver once it receives the next heartbeat.

Consider a rogue scenario: A regionserver sends a report and the 
max_completeSequenceId = 100. There is a write heavy workload and WAL is rolled 
and then server abort. And master missed all its heartbeats before the rs 
aborted. Based on max_completeSequenceId, we need to read last 2 WAL files (1 + 
1): 1 new one, and 1 at which master got the last heartbeat (it has some 
entries > 100). Since we are reading most current ones first, it is easy to 
determine whether we need to older WALs or not. Let's call those files f1 and 
f2 where f1 is the latest. 
It reads f1 first and see that the first waledit#key#logSequenceId > 100, so it 
en-queues f2 also as there might be some entries at f2's tail which are missed.
Once it has read f1 and f2, and updated the allWALEntriesFlushed for the 
regions, master can decide which regions can be assigned right away.

Hope this helps.
                
> Immediate assignment of regions that don't have entries in HLog
> ---------------------------------------------------------------
>
>                 Key: HBASE-6774
>                 URL: https://issues.apache.org/jira/browse/HBASE-6774
>             Project: HBase
>          Issue Type: Improvement
>          Components: master, regionserver
>    Affects Versions: 0.95.2
>            Reporter: Nicolas Liochon
>            Assignee: Himanshu Vashishtha
>         Attachments: HBase-6774-approach.pdf
>
>
> The algo is today, after a failure detection:
> - split the logs
> - when all the logs are split, assign the regions
> But some regions can have no entries at all in the HLog. There are many 
> reasons for this:
> - kind of reference or historical tables. Bulk written sometimes then read 
> only.
> - sequential rowkeys. In this case, most of the regions will be read only. 
> But they can be in a regionserver with a lot of writes.
> - tables flushed often for safety reasons. I'm thinking about meta here.
> For meta; we can imagine flushing very often. Hence, the recovery for meta, 
> in many cases, will be the failure detection time.
> There are different possible algos:
> Option 1)
>  A new task is added, in parallel of the split. This task reads all the HLog. 
> If there is no entry for a region, this region is assigned.
>  Pro: simple
>  Cons: We will need to read all the files. Add a read.
> Option 2)
>  The master writes in ZK the number of log files, per region.
>  When the regionserver starts the split, it reads the full block (64M) and 
> decrease the log file counter of the region. If it reaches 0, the assign 
> start. At the end of its split, the region server decreases the counter as 
> well. This allow to start the assign even if not all the HLog are finished. 
> It would allow to make some regions available even if we have an issue in one 
> of the log file.
>  Pro: parallel
>  Cons: add something to do for the region server. Requites to read the whole 
> file before starting to write. 
> Option 3)
>  Add some metadata at the end of the log file. The last log file won't have 
> meta data, as if we are recovering, it's because the server crashed. But the 
> others will. And last log file should be smaller (half a block on average).  
> Option 4) Still some metadata, but in a different file. Cons: write are 
> increased (but not that much, we just need to write the region once). Pros: 
> if we lose the HLog files (major failure, no replica available) we can still 
> continue with the regions that were not written at this stage.
> I think it should be done, even if none of the algorithm above is totally 
> convincing yet. It's linked as well to locality and short circuit reads: with 
> these two points reading the file twice become much less of an issue for 
> example. My current preference would be to open the file twice in the region 
> server, once for splitting as of today, once for a quick read looking for 
> unused regions. Who knows, may be it would even be faster this way, the quick 
> read thread would warm-up the different caches for the splitting thread.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to