busbey commented on a change in pull request #1232: HBASE-23198 Update ref 
guide for distributed MOB compaction.
URL: https://github.com/apache/hbase/pull/1232#discussion_r387262356
 
 

 ##########
 File path: src/main/asciidoc/_chapters/hbase_mob.adoc
 ##########
 @@ -181,84 +320,51 @@ suit your environment, and restart or rolling restart 
the RegionServer.
 ----
 ====
 
-=== MOB Optimization Tasks
-
 ==== Manually Compacting MOB Files
 
 To manually compact MOB files, rather than waiting for the
-<<mob.cache.configure,configuration>> to trigger compaction, use the
-`compact` or `major_compact` HBase shell commands. These commands
+periodic chore to trigger compaction, use the
+`major_compact` HBase shell commands. These commands
 require the first argument to be the table name, and take a column
-family as the second argument. and take a compaction type as the third 
argument.
+family as the second argument. If used with a column family that includes MOB 
data, then
+these operator requests will result in the MOB data being compacted.
 
 ----
-hbase> compact 't1', 'c1’, ‘MOB’
-hbase> major_compact 't1', 'c1’, ‘MOB’
+hbase> major_compact 't1'
+hbase> major_compact 't2', 'c1’
 ----
 
-These commands are also available via `Admin.compact` and
-`Admin.majorCompact` methods.
-
-=== MOB architecture
-
-This section is derived from information found in
-link:https://issues.apache.org/jira/browse/HBASE-11339[HBASE-11339]. For more 
information see
-the attachment on that issue
-"link:https://issues.apache.org/jira/secure/attachment/12724468/HBase%20MOB%20Design-v5.pdf[Base
 MOB Design-v5.pdf]".
-
-==== Overview
-The MOB feature reduces the overall IO load for configured column families by 
storing values that
-are larger than the configured threshold outside of the normal regions to 
avoid splits, merges, and
-most importantly normal compactions.
-
-When a cell is first written to a region it is stored in the WAL and memstore 
regardless of value
-size. When memstores from a column family configured to use MOB are eventually 
flushed two hfiles
-are written simultaneously. Cells with a value smaller than the threshold size 
are written to a
-normal region hfile. Cells with a value larger than the threshold are written 
into a special MOB
-hfile and also have a MOB reference cell written into the normal region HFile.
-
-MOB reference cells have the same key as the cell they are based on. The value 
of the reference cell
-is made up of two pieces of metadata: the size of the actual value and the MOB 
hfile that contains
-the original cell. In addition to any tags originally written to HBase, the 
reference cell prepends
-two additional tags. The first is a marker tag that says the cell is a MOB 
reference. This can be
-used later to scan specifically just for reference cells. The second stores 
the namespace and table
-at the time the MOB hfile is written out. This tag is used to optimize how the 
MOB system finds
-the underlying value in MOB hfiles after a series of HBase snapshot operations 
(ref HBASE-12332).
-Note that tags are only available within HBase servers and by default are not 
sent over RPCs.
+This same request can be made via the `Admin.majorCompact` Java API.
 
-All MOB hfiles for a given table are managed within a logical region that does 
not directly serve
-requests. When these MOB hfiles are created from a flush or MOB compaction 
they are placed in a
-dedicated mob data area under the hbase root directory specific to the 
namespace, table, mob
-logical region, and column family. In general that means a path structured 
like:
+=== MOB Troubleshooting
 
-----
-%HBase Root Dir%/mobdir/data/%namespace%/%table%/%logical region%/%column 
family%/
-----
+==== Adjusting the MOB cleaner's tolerance for new hfiles
 
-With default configs, an example table named 'some_table' in the
-default namespace with a MOB enabled column family named 'foo' this HDFS 
directory would be
+The MOB cleaner chore ignores all MOB hfiles that were created more recently 
than an hour prior to
+the start of the shore to ensure we don't miss the reference metadata from teh 
corresponding regular
+hfile. Without this safety check it would be possible for the cleaner chore to 
see a MOB hfile for
+an in progress flush or compaction and prematurely archive the MOB data. This 
default buffer should
+be sufficient for normal use.
 
-----
-/hbase/mobdir/data/default/some_table/372c1b27e3dc0b56c3a031926e5efbe9/foo/
-----
-
-These MOB hfiles are maintained by special chores in the HBase Master rather 
than by any individual
-Region Server. Specifically those chores take care of enforcing TTLs and 
compacting them. Note that
-this compaction is primarily a matter of controlling the total number of files 
in HDFS because our
-operational assumptions for MOB data is that it will seldom update or delete.
+You will need to adjust the tolerance if it takes longer than an hour for the 
two HDFS move
 
 Review comment:
   the paragraph doesn't talk about the time to do things on the master in the 
cleaning chore. It's talking about the amount of time it takes the region 
server that's doing a flush or compaction.
   
   for example, consider if something is *very* wrong with HDFS such that the 
RS doing a flush commits the mob file and then pauses for > 1 hour before 
committing the reference hfile. if the master did it's "min age" calculation an 
hour after the mob file commit and it finished iterating over the reference 
files prior to the commit of the reference hfile, then the cleaner chore would 
believe it is fine to delete the mob hfile.
   
   similarly, if the compaction process commits a mob hfile (due to the size 
limitation) and then spends another 3 hours writing other mob hfiles before 
finally closing out and committing the reference hfile, then the master cleaner 
chore could have the timing work out so that some of the early mob hfiles are 
flagged as fine to delete.
   
   It's definitely edge case stuff. that's why it's in the troubleshooting 
section.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to