[ 
https://issues.apache.org/jira/browse/HBASE-10201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14215113#comment-14215113
 ] 

stack commented on HBASE-10201:
-------------------------------

bq. We need to change protobuf definition

We could add extra fields in pb and write to two places for the life of an 
hbase version to support rolling upgrade.

I hope you do not mind me surfacing here questions asked off list -- its best 
to keep the discussion up here rather than off-list so others can participate 
too. 

You described off-list how the distributed log replay opens a region and puts 
the highest *sequenceid* found up in zk and then uses this to figure which 
edits to replay. You also talk of how regionServerReport includes the last 
flush id of each region we carry and that the master keeps this around so on 
log replay we can skip edits already flushed. You then ask:

bq. I think I need to change all these places to use a map which stored 
familyName->maxSeqId instead of a single SeqId. Am I right?

The sequenceid is *region-scoped*: i.e. we keep a running sequenceid per 
region. For the above to work out, we'd need to change the sequenceid scope to 
be instead column-family rather than region.  Since our memstore is by column 
family, and since the memstore now uses the region sequenceid as its MVCC, this 
might be a good direction to go in but it is not what we have now.

You cannot have it so there are discontinuities in the progress of the flush 
sequenceid. If four column families, the edits can go in to any of the four 
families in any order. 

You could do something like [~gaurav.menghani] did (See 
https://issues.apache.org/jira/browse/HBASE-10201?focusedCommentId=14191203&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14191203)
 suggests above where rather than report on successful flush, the highest 
sequenceid of all a regions' memstores involved in a flush, instead, when you 
flush a column family only, you'd have to report one less than the oldest 
outstanding edit still alive up in a column family memstore.

What if you did something much less involved; when there is pressure to flush, 
flush the stores with the oldest edits until you've freed enough memory?

Upsides are that you'd clear out old edits from memory and we might let go of 
WALs a little faster.  Also, you might not flush all of the content in a region 
-- because flushing just a few stores might be enough to get you back under the 
threshold -- so we might make less small storefiles?

Downsides are we'd make some small storefiles (e.g. for those stores that have 
a few old edits in them and little else) and we'd do the flush in series rather 
than in //.  Because of sequenceid accounting, we might replay more edits than 
we have to.

> Port 'Make flush decisions per column family' to trunk
> ------------------------------------------------------
>
>                 Key: HBASE-10201
>                 URL: https://issues.apache.org/jira/browse/HBASE-10201
>             Project: HBase
>          Issue Type: Improvement
>          Components: wal
>            Reporter: Ted Yu
>            Assignee: zhangduo
>            Priority: Critical
>             Fix For: 2.0.0, 0.98.9, 0.99.2
>
>         Attachments: 3149-trunk-v1.txt, HBASE-10201-0.98.patch, 
> HBASE-10201-0.98_1.patch, HBASE-10201-0.98_2.patch, HBASE-10201-0.99.patch, 
> HBASE-10201.patch, HBASE-10201_1.patch, HBASE-10201_2.patch, 
> HBASE-10201_3.patch, HBASE-10201_4.patch, HBASE-10201_5.patch, 
> HBASE-10201_6.patch, HBASE-10201_7.patch
>
>
> Currently the flush decision is made using the aggregate size of all column 
> families. When large and small column families co-exist, this causes many 
> small flushes of the smaller CF. We need to make per-CF flush decisions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to