[ 
https://issues.apache.org/jira/browse/HBASE-11339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14119741#comment-14119741
 ] 

Jingcheng Du commented on HBASE-11339:
--------------------------------------

Thanks Lars for the comments. [~lhofhansl]

bq. Going by the comments the use case is only 1-5mb files (definitely less 
than 64mb), correct? That changes the discussion, but it looks to me that now 
the use case is limited to a single scenario and carefully constructed (200m x 
500k files) so that this change might be useful. I.e. pick a blob size just 
right, and pick the size distribution of the files just right and this makes 
sense.
the client solution could work well too in certain cases of bigger size blobs 
and we could try leveraging the current MOB design approach for smaller values 
of KVs.
In some usage scenarios, the value size is almost fixed, for example the 
pictures taken by camera of the traffic bureau, the contracts between banks and 
customers, the CT(Computed Tomography) records in hospitals, etc. This might be 
limited, but it’s really useful.
As mentioned the client solution saves the records larger than 10MB to hdfs, 
and saves others to the HBase directly. To turn down the threshold less will 
lead to the insufficient using of the hdfs in client solution, instead saving 
them directly in HBase for this case.
And even with value size less 10MB, the mob implementation has big improvements 
in performance than directly saving those records into HBase.

The mob has a threshold as well, the mob could be saved as either value or 
reference by this threshold. We have a default value 100KB for it now. Users 
could change it and we also have a compactor to handle it (move the mob file to 
hbase, and vice versa).

As Jon said, we'll revamp the mob compaction and get rid of the MR dependency.

bq. Yet, all that is possible to do with a client only solution and could be 
abstracted there.
To implement the snapshot, replication things in client solution are harder, it 
will bring the complexity for the client solution as well. To keep the 
consistency bwtween HBase and HDFS files during replication is a problem.
To implement this in server side is a little bit easier, the mob includes the 
implementation of snapshot, and it supports the replication naturally because 
the mob data are saved in WAL.

bq. (Subjectively) I do not like the complexity of this as seen by the various 
discussions here. That part is just my $0.02 of course.
Yes, it’s complex, but they are meaningful and valuable.
The patches provide features of read/write, compactions, snapshot and sweep for 
mob files. Even in the future HBase decides to implement streaming feature, the 
read, compaction, and snapshot parts would be useful probably.

Thanks!


> HBase MOB
> ---------
>
>                 Key: HBASE-11339
>                 URL: https://issues.apache.org/jira/browse/HBASE-11339
>             Project: HBase
>          Issue Type: Umbrella
>          Components: regionserver, Scanners
>            Reporter: Jingcheng Du
>            Assignee: Jingcheng Du
>         Attachments: HBase MOB Design-v2.pdf, HBase MOB Design-v3.pdf, HBase 
> MOB Design-v4.pdf, HBase MOB Design.pdf, MOB user guide.docx, MOB user 
> guide_v2.docx, hbase-11339-in-dev.patch
>
>
>   It's quite useful to save the medium binary data like images, documents 
> into Apache HBase. Unfortunately directly saving the binary MOB(medium 
> object) to HBase leads to a worse performance since the frequent split and 
> compaction.
>   In this design, the MOB data are stored in an more efficient way, which 
> keeps a high write/read performance and guarantees the data consistency in 
> Apache HBase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to