[
https://issues.apache.org/jira/browse/HBASE-11339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14118659#comment-14118659
]
Jonathan Hsieh commented on HBASE-11339:
----------------------------------------
Lars and I chatted back on Thursday, and agree about the solution for truly
large objects (larger than default hdfs block) would require a new streaming
API. We also talked about the importance of making configuration and
operations simple.
[~lhofhansl], can you describe the scale and the load for the hybrid MOB
storage system you are have or are working on? It is new to me, and I'd very
curious about how things like backups and bulk loads are handled in that
system.
Details below:
----
The fundamental problem this MOB solution is addressing is the a balance
between is the hdfs small file problem and write amplification and performance
variability caused by write amplification. Objects that have greater than
64MB+ values are out of scope and we are in agreement about needing a streaming
api and that a hdfs+hbase solution seems more reasonable. The goal with the
MOB mechanism is to show demonstrable improvements in predictability and
scalability when we are too small for where hdfs makes sense and where hbase is
non-optimal due to splits and compactions.
In some workloads I've been seeing, (timeseries large sensor dumps, mini
indexes, or binary documents or images as blob cells) this feature would
potentially be very helpful.
bq. We still cannot stream the mobs. They have to be materialized at both the
server and the client (going by the documentation here)
This is true; however this is *not* the design point we are trying to solve.
bq. As I state above this can be achieved with a HBase/HDFS client alone and
better: Store mobs up to a certain size by value in HBase (say 5 or 10mb or
so), everything larger goes straight into HDFS with a reference only in HBase.
This addresses both the many small files issue in HDFS (only files larger than
5-10mb would end up in HDFS) and the streaming problem for large files in
HBase. Also as outlined by me in June we can still make this "transactional" in
the HBase sense with a three step protocol: (1) write reference row, (2) stream
blob to HDFS, (3) record location in HDFS (that's the commit). This solution is
also missing from the initial PDF in the "Existing Solutions" section.
Back in June, JingCheng's response to your comments never got feedback on how
you'd manage the small files problem.
Also, in there are two HDFS blob + HBase metadata solutions are explicitly
mentioned in section 4.1.2 (v4 design doc) with pros and cons. The solution
you propose is actually the first described hdfs+hbase approach -- though its
pro's and con's don't go into the particulars of the commit protocol (though
the two-phase prep and then commit be the commit make sense). The largest
concern was in the doc as well -- the HDFS small files problem.
Having a separate hdfs file per 100k-10mb value is not a scaleable or long term
solution. Let's do an example -- lets say we wrote 200M 500KB blobs. This
ends up being 100TB of data.
* Using hbase as is, we end up with objects in potentially in 10,000 10GB files.
** Along the way, we'd end up splitting and compacting every 20,000 objects,
rewriting large chunks of the 100TB over and over.
* Using hdfs+hbase, we'd end up with a 200M files -- a lot more files than the
optimal 10000 files that vanilla hbase approach could eventually compact to.
** 200M files would consume ~200GB ram for block records in the NN (200M files
* 3 block replicas per file * ~300 bytes per hdfs inode+ blockinfo [1][2] ->
~200GB), which is definitely in an uncharted area for NN's -- a place where
there would likely be GC problems and other negative affects.
bq. "Replication" here can still happen by the client, after all, each file
successfully stored in HDFS has a reference in HBase.
The design doc approach actually minimizes the operational changes required to
store MOBs. From an operational point of view, a users could just enable the
optional feature and potentially take advantage of its potential benefits.
This hdfs+hbase proposed approach actually pushes more complexity into the
replication mechanism. For replication to work, the source cluster would now
need to add mechanisms to open the mobs on the hdfs and ship them off to the
other cluster. The MOB approach is simpler operationally and in code because
it can use the normal replication mechanism.
The hdfs+hbase proposed approach would need updates and a new bulk load
mechanism. The MOB approach is simpler operationally and in code because it
can would use normal bulk loads and compactions would push out eventually push
the mobs out. (same IO cost)
The hdfs+hbase proposed approach would need updates to properly handle table
snapshots and restoring table snapshots and backups. Naively this seems like
we'd either have to do a lot of NN operations to backup the mobs. (1 per mob).
Also, we'd need to build new tools to manage export and copy table operations
as well. The MOB approach is simpler because the copy/export table
mechanisms remain the same, and we can use the same archiver mechanism to
manage mob file snapshots (mobs are essentially stored in a special region).
bq. We should use the tools what they were intended for. HBase for key value
storage, HDFS for streaming large blobs.
We agree here.
The case for HDFS is weak: 1MB-5MB blobs are not large enough for HDFS -- in
fact for HDFS this is nearly pathological.
The case for HBase is ok: We are writing key-values that tend to be larger than
normal. However, with constant continuous ingest with 1MB-5MB MOBs will likely
cause trigger splits more frequently which will trigger unavoidable major
compactions. This would occur even if bulk load mechanisms were used.
The case for HBase + MOB is stronger: We are writing key-values that tend to
be large. The bulk of the 1MB-5MB MOB data is written off to the MOB path.
The metadata for the mob (let's say 100 bytes per) is relatively small and
thus compactions and splits will be much rarer (10000x less frequent) than the
hbase-only approach. If bulk loads are done, an initial compaction would
separate out the mob data and keep the region relatively small.
bq. Just saying using one client API for client convenience is not a reason to
put all of this into HBase. A client can easily speak both HBase and HDFS
protocols.
The tradeoff being made by the hdfs+hbase approach is opting for more
operational complexity vs implementation simplicity. With the hdfs+hbase
approach, we'd also introduce new security issues -- now users in hbase would
have the ability to modify the file system directly, and now manage users and
credentials on both hdfs and hbase in sync. With the MOB approach we just rely
on HBase's security mechanism, its and should be able have per cell ACLs, vis
tags, and all the rest.
Given the choice of 1) making hbase simpler to use by adding some internal
complexity or 2) making hbase more operationally difficult by adding new
external processes and requiring external integrations to manage parts of its
data, we should opt for 1. Making HBase easier to use by removing knobs or
making knobs as simple as possible should be the priority.
bq. (Subjectively) I do not like the complexity of this as seen by the various
discussions here. That part is just my $0.02 of course.
We agree about not liking complexity. However, the discussion process was
public and we described and knocked down several strawmen. I actually initially
took the side against adding this feature but have been convinced me that when
complete, this would have light operator impact and actually less complex than
a full solution that uses the hybrid hdfs+hbase approach.
[1] https://issues.apache.org/jira/browse/HDFS-6658
[2]
https://issues.apache.org/jira/secure/attachment/12651408/Block-Manager-as-a-Service.pdf
(see RAM Explosion section)
> HBase MOB
> ---------
>
> Key: HBASE-11339
> URL: https://issues.apache.org/jira/browse/HBASE-11339
> Project: HBase
> Issue Type: Umbrella
> Components: regionserver, Scanners
> Reporter: Jingcheng Du
> Assignee: Jingcheng Du
> Attachments: HBase MOB Design-v2.pdf, HBase MOB Design-v3.pdf, HBase
> MOB Design-v4.pdf, HBase MOB Design.pdf, MOB user guide.docx, MOB user
> guide_v2.docx, hbase-11339-in-dev.patch
>
>
> It's quite useful to save the medium binary data like images, documents
> into Apache HBase. Unfortunately directly saving the binary MOB(medium
> object) to HBase leads to a worse performance since the frequent split and
> compaction.
> In this design, the MOB data are stored in an more efficient way, which
> keeps a high write/read performance and guarantees the data consistency in
> Apache HBase.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)