[
https://issues.apache.org/jira/browse/HBASE-11339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14119741#comment-14119741
]
Jingcheng Du commented on HBASE-11339:
--------------------------------------
Thanks Lars for the comments. [~lhofhansl]
bq. Going by the comments the use case is only 1-5mb files (definitely less
than 64mb), correct? That changes the discussion, but it looks to me that now
the use case is limited to a single scenario and carefully constructed (200m x
500k files) so that this change might be useful. I.e. pick a blob size just
right, and pick the size distribution of the files just right and this makes
sense.
the client solution could work well too in certain cases of bigger size blobs
and we could try leveraging the current MOB design approach for smaller values
of KVs.
In some usage scenarios, the value size is almost fixed, for example the
pictures taken by camera of the traffic bureau, the contracts between banks and
customers, the CT(Computed Tomography) records in hospitals, etc. This might be
limited, but it’s really useful.
As mentioned the client solution saves the records larger than 10MB to hdfs,
and saves others to the HBase directly. To turn down the threshold less will
lead to the insufficient using of the hdfs in client solution, instead saving
them directly in HBase for this case.
And even with value size less 10MB, the mob implementation has big improvements
in performance than directly saving those records into HBase.
The mob has a threshold as well, the mob could be saved as either value or
reference by this threshold. We have a default value 100KB for it now. Users
could change it and we also have a compactor to handle it (move the mob file to
hbase, and vice versa).
As Jon said, we'll revamp the mob compaction and get rid of the MR dependency.
bq. Yet, all that is possible to do with a client only solution and could be
abstracted there.
To implement the snapshot, replication things in client solution are harder, it
will bring the complexity for the client solution as well. To keep the
consistency bwtween HBase and HDFS files during replication is a problem.
To implement this in server side is a little bit easier, the mob includes the
implementation of snapshot, and it supports the replication naturally because
the mob data are saved in WAL.
bq. (Subjectively) I do not like the complexity of this as seen by the various
discussions here. That part is just my $0.02 of course.
Yes, it’s complex, but they are meaningful and valuable.
The patches provide features of read/write, compactions, snapshot and sweep for
mob files. Even in the future HBase decides to implement streaming feature, the
read, compaction, and snapshot parts would be useful probably.
Thanks!
> HBase MOB
> ---------
>
> Key: HBASE-11339
> URL: https://issues.apache.org/jira/browse/HBASE-11339
> Project: HBase
> Issue Type: Umbrella
> Components: regionserver, Scanners
> Reporter: Jingcheng Du
> Assignee: Jingcheng Du
> Attachments: HBase MOB Design-v2.pdf, HBase MOB Design-v3.pdf, HBase
> MOB Design-v4.pdf, HBase MOB Design.pdf, MOB user guide.docx, MOB user
> guide_v2.docx, hbase-11339-in-dev.patch
>
>
> It's quite useful to save the medium binary data like images, documents
> into Apache HBase. Unfortunately directly saving the binary MOB(medium
> object) to HBase leads to a worse performance since the frequent split and
> compaction.
> In this design, the MOB data are stored in an more efficient way, which
> keeps a high write/read performance and guarantees the data consistency in
> Apache HBase.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)