[
https://issues.apache.org/jira/browse/HBASE-11861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241422#comment-14241422
]
Jonathan Hsieh commented on HBASE-11861:
----------------------------------------
This is a good discussion -- I'll spend some time designing out alternatives in
more detail today and come back with some alternatives to consider (single
region or globlal, locations for delmobs and potential race conditions). I
would like to note that there are some pieces that could be implemented while
this is going on (e.g. the code for modifying the compaction scan to write del
mob lists will be the same regardless of where different actions end up
happening).
I was initially thinking a global mob compaction, that would scan the set of
del mob files once and only rewrite some regions. The HM would keep a summary
of the delmob files in memory and figure out which mobs need to rewritten, and
for a first cut the HM would do the IO heavy portions since it is simpler and
might be sufficient . If that isn't the case, then we'd have the hm farm
compaction work out to make the process distributed using either the procedure
(my vote since [~mbertozzi] is doing work over there) or distributed log
splitting infrastructure to coordinate.
{quote}
But it's better to compact the mob files in each region since we have to
synchronize the major compaction and mob compaction to avoid the race
condition. The way we do in the sweep tool is to use zookeeper. If we do the
mob compaction in region, we could do it in locks.
{quote}
With the del mob hfile approach, I'm not sure if there is a race condition
between major compactions and mob compactions. In the 141030 attachmentv
illustrated on slide #23 , we use bulk loading of new mobs and new references,
and then use hfile links (or something like them) when reading mobs so that we
point to the original mobs or archived mobs (similar to snapshots). This
avoid the need for zk locks or and only uses the region locks already hardened
in the bulk load process. .
{quote}
You're right, if the regions are merged, we could not find the related mob
files at all only by the md5 of the start key.
Currently we have the start key and stop key in the metadata of hfiles. It
means we could not get them only by the file names, but need to open readers to
the files.
Do you have ideas on this to track the start and stop key besides reading the
metadata, to revise the pattern of a mob file name? Please advise. Thanks.
{quote}
I think we'd need to revise the pattern on the del mob file name -- it would
likely need a tuple of (start key, end key, start key, # unique mobs), These
cells would have pointers to the particular files so we could gather counts of
how many cells are being deleted. We might be able to get away with not
changing the format / name of the mob files themselves.
{quote}
Is that possible there are too many delmob files? If not, we could directly
open scanners to these delmob files.
Jon, do you have comments for the way to map the file names to deleted cells?
{quote}
This I don't really know -- let's do a back of the envelope.
We create a del mob file per region compaction (major, and potentially minor
due to ttl age offs). Worst case we delete exactly one mob per compaction.
Assuming 1MB / mob, we might have to have 500 del mobs to meet a 50% threshold
on a 1GB mob file (and this is per region). That is a lot of files.
So I agree, this sounds like this would be a potential problem.
{quote}
Sorry, I missed the merge case. In order to get the start/stop keys
information, we have to read the mob files instead of file names in each region
now.
The region split and merge case will be handled in mob compaction by regions.
For split, If the start key of a mob file is between the start and stop keys of
a region, this mob file is handled by this region. This mob file might cross
regions by checking the its stop key. If this mob file crosses regions, it will
create two/or more ref file for each daughter regions. Each of the ref file is
handled in the mob compaction of daughter regions.
For merges, the files are not across regions, we directly select the mob files
if they're qualified (small or invalid) owned by the current region.
in the mob compaction of a b, if a mob file file#1 is selected we need create
two ref files, one for a b named ref-ab-file#1, the other is for c d named
ref-cd-file#1 (If a mob file is not selected, we don't need to create them at
all). The ref file ref-ab-file#1 s handled in the mob compaction of a b to
generate a new mob file file#1ab, the ref-bc-file#1 is handled in the mob
compaction of c d to generate the mob file file#1cd.
After the region ab is split, if( and only if) the file file#1ab is selected in
the mob compaction of region a, the new ref files are created and handled by
region a and region b.
For merge, it's easier than the split, directly select the small or invalid mob
files whose start/stop keys are between the key range of the current region.
{quote}
I think we can have something simpler if we use a different approach. We know
these invariants:
* The del mobs have the names of the mob files.
* Splits or merges do not affect the mob files at all. (doing del mobs should
decouple major compactions for mob compactions).
If we do a scan on the del mobs instead of the mob files, we could get counts
in specific mob files and figure out which mob files to rewrite/compact with
other mob files. Using the reference bulk load mentioned early, we don't even
have to worry about splits or merges of the normal regions.
This has me really leaning more and more towards a global delmob scan on the
master to id mob hfiles to compact as opposed to a per region approach.
{quote}
Currently we track the start/stop keys in the metadata of mob files. But it's
hard to track the counts in each mob file since we have threshold for the mob
cells.
In this design doc, the mob compaction is handled in each region, it means only
part of mob files (owned by the current region) could be handled each time.
Instead, we could also do the mob compaction globally (in one single place) for
all the mob files. But how to avoid the race condition between the major
compaction and mob compaction for this? Still use the zookeeper?
Since the major compaction and mob compaction are not frequent, and deletion is
rare in the mob cases, could we ignore the race condition directly? Please
advise. Thanks.
{quote}
I think the bulk load approach avoids the potential race on mob compaction and
normal compaction. There might be the case where a new delmob shows up while a
mob compaction is happening but we'd just need to keep the list of del mobs we
are reading when we do the del mob scan so that we don't accidentalkly remove
new del mobs a normal compaction would create while a mob compaction was
happening.
> Native MOB Compaction mechanisms.
> ---------------------------------
>
> Key: HBASE-11861
> URL: https://issues.apache.org/jira/browse/HBASE-11861
> Project: HBase
> Issue Type: Sub-task
> Components: regionserver, Scanners
> Affects Versions: 2.0.0
> Reporter: Jonathan Hsieh
> Attachments: 141030-mob-compaction.pdf, mob compaction.pdf
>
>
> Currently, the first cut of mob will have external processes to age off old
> mob data (the ttl cleaner), and to compact away deleted or over written data
> (the sweep tool).
> From an operational point of view, having two external tools, especially one
> that relies on MapReduce is undesirable. In this issue we'll tackle
> integrating these into hbase without requiring external processes.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)