[ 
https://issues.apache.org/jira/browse/HBASE-11339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14118659#comment-14118659
 ] 

Jonathan Hsieh commented on HBASE-11339:
----------------------------------------

Lars and I chatted back on Thursday, and agree about the solution for truly 
large objects (larger than default hdfs block) would require a new streaming 
API.  We also talked about the importance of making configuration and 
operations simple.  

[~lhofhansl], can you describe the scale and the load for the hybrid MOB 
storage system you are have or are working on?  It is new to me, and I'd very 
curious about how things like backups and bulk loads are handled in that 
system.  

Details below:

----

The fundamental problem this MOB solution is addressing is the a balance 
between is the hdfs small file problem and write amplification and performance 
variability caused by write amplification.  Objects that have greater than 
64MB+ values are out of scope and we are in agreement about needing a streaming 
api and that a hdfs+hbase solution seems more reasonable.  The goal with the 
MOB mechanism is to show demonstrable improvements in predictability and 
scalability when we are too small for where hdfs makes sense and where hbase is 
non-optimal due to splits and compactions.

In some workloads I've been seeing, (timeseries large sensor dumps, mini 
indexes, or binary documents or images as blob cells) this feature would 
potentially be very helpful.


bq. We still cannot stream the mobs. They have to be materialized at both the 
server and the client (going by the documentation here)

This is true; however this is *not* the design point we are trying to solve.

bq. As I state above this can be achieved with a HBase/HDFS client alone and 
better: Store mobs up to a certain size by value in HBase (say 5 or 10mb or 
so), everything larger goes straight into HDFS with a reference only in HBase. 
This addresses both the many small files issue in HDFS (only files larger than 
5-10mb would end up in HDFS) and the streaming problem for large files in 
HBase. Also as outlined by me in June we can still make this "transactional" in 
the HBase sense with a three step protocol: (1) write reference row, (2) stream 
blob to HDFS, (3) record location in HDFS (that's the commit). This solution is 
also missing from the initial PDF in the "Existing Solutions" section.

Back in June, JingCheng's response to your comments never got feedback on how 
you'd manage the small files problem.

Also, in there are two HDFS blob + HBase metadata solutions are explicitly 
mentioned in section 4.1.2 (v4 design doc) with pros and cons.  The solution 
you propose is actually the first described hdfs+hbase approach -- though its 
pro's and con's don't go into the particulars of the commit protocol (though 
the two-phase prep and then commit be the commit make sense).  The largest 
concern was in the doc as well -- the HDFS small files problem.

Having a separate hdfs file per 100k-10mb value is not a scaleable or long term 
solution.  Let's do an example -- lets say we wrote 200M 500KB blobs.  This 
ends up being 100TB of data.

* Using hbase as is, we end up with objects in potentially in 10,000 10GB files.
** Along the way, we'd end up splitting and compacting every 20,000 objects, 
rewriting large chunks of the 100TB over and over.
* Using hdfs+hbase, we'd end up with a 200M files -- a lot more files than the 
optimal 10000 files that vanilla hbase approach could eventually compact to.
** 200M files would consume ~200GB ram for block records in the NN (200M files 
* 3 block replicas per file * ~300 bytes per hdfs inode+ blockinfo [1][2]  ->  
~200GB), which is definitely in an  uncharted area for NN's  -- a place where 
there would likely be GC problems and other negative affects.

bq. "Replication" here can still happen by the client, after all, each file 
successfully stored in HDFS has a reference in HBase.

The design doc approach actually minimizes the operational changes required to 
store MOBs.  From an operational point of view, a users could just enable the 
optional feature and potentially take advantage of its potential benefits.

This hdfs+hbase proposed approach actually pushes more complexity into the 
replication mechanism.  For replication to work, the source cluster would now 
need to add mechanisms to open the mobs on the hdfs and ship them off to the 
other cluster.  The MOB approach is simpler operationally and in code because 
it can use the normal replication mechanism.

The hdfs+hbase proposed approach would need updates and a new bulk load 
mechanism.  The MOB approach is simpler operationally and in code because it 
can would use normal bulk loads and compactions would push out eventually push 
the mobs out.  (same IO cost)

The hdfs+hbase proposed approach would need updates to properly handle table 
snapshots and restoring table snapshots and backups.  Naively this seems like 
we'd either have to do a lot of NN operations to backup the mobs. (1 per mob).  
Also, we'd need to build new tools to manage export and copy table operations 
as well.    The MOB approach is simpler because the copy/export table 
mechanisms remain the same, and we can use the same archiver mechanism to 
manage mob file snapshots (mobs are essentially stored in a special region).

bq. We should use the tools what they were intended for. HBase for key value 
storage, HDFS for streaming large blobs.

We agree here.

The case for HDFS is weak: 1MB-5MB blobs are not large enough for HDFS -- in 
fact for HDFS this is nearly pathological.

The case for HBase is ok: We are writing key-values that tend to be larger than 
normal.  However, with constant continuous ingest with 1MB-5MB MOBs will likely 
cause trigger splits more frequently which will trigger unavoidable major 
compactions.  This would occur even if bulk load mechanisms were used.

The case for HBase + MOB is stronger:  We are writing key-values that tend to 
be large.  The bulk of the 1MB-5MB MOB data is written off to the MOB path.  
The metadata for the mob  (let's say 100 bytes per) is relatively small and 
thus compactions and splits will be much rarer (10000x less frequent) than the 
hbase-only approach.  If bulk loads are done, an initial compaction would 
separate out the mob data and keep the region relatively small.

bq. Just saying using one client API for client convenience is not a reason to 
put all of this into HBase. A client can easily speak both HBase and HDFS 
protocols.

The tradeoff being made by the hdfs+hbase approach is opting for more 
operational complexity vs implementation simplicity.  With the hdfs+hbase 
approach, we'd also introduce new  security issues -- now users in hbase would 
have the ability to modify the file system directly, and now manage users and 
credentials on both hdfs and hbase in sync.  With the MOB approach we just rely 
on HBase's security mechanism, its and should be able have per cell ACLs, vis 
tags, and all the rest.

Given the choice of 1) making hbase simpler to use by adding some internal 
complexity or 2) making hbase more operationally difficult by adding new 
external processes and requiring external integrations to manage parts of its 
data, we should opt for 1.  Making HBase easier to use by removing knobs or 
making knobs as simple as possible should be the priority.

bq. (Subjectively) I do not like the complexity of this as seen by the various 
discussions here. That part is just my $0.02 of course.

We agree about not liking complexity.  However, the discussion process was 
public and we described and knocked down several strawmen. I actually initially 
took the side against adding this feature but have been convinced me that when 
complete, this would have light operator impact and actually less complex than 
a full solution that uses the hybrid hdfs+hbase approach.

[1] https://issues.apache.org/jira/browse/HDFS-6658
[2] 
https://issues.apache.org/jira/secure/attachment/12651408/Block-Manager-as-a-Service.pdf
 (see RAM Explosion section)

> HBase MOB
> ---------
>
>                 Key: HBASE-11339
>                 URL: https://issues.apache.org/jira/browse/HBASE-11339
>             Project: HBase
>          Issue Type: Umbrella
>          Components: regionserver, Scanners
>            Reporter: Jingcheng Du
>            Assignee: Jingcheng Du
>         Attachments: HBase MOB Design-v2.pdf, HBase MOB Design-v3.pdf, HBase 
> MOB Design-v4.pdf, HBase MOB Design.pdf, MOB user guide.docx, MOB user 
> guide_v2.docx, hbase-11339-in-dev.patch
>
>
>   It's quite useful to save the medium binary data like images, documents 
> into Apache HBase. Unfortunately directly saving the binary MOB(medium 
> object) to HBase leads to a worse performance since the frequent split and 
> compaction.
>   In this design, the MOB data are stored in an more efficient way, which 
> keeps a high write/read performance and guarantees the data consistency in 
> Apache HBase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to