[
https://issues.apache.org/jira/browse/HBASE-11339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14119489#comment-14119489
]
Lars Hofhansl commented on HBASE-11339:
---------------------------------------
bq. Back in June, JingCheng's response to your comments never got feedback on
how you'd manage the small files problem.
To be fair, my comment itself addressed that by saying small blobs are stored
by *value* in HBase, and only large bloba in HDFS. We can store a lot of 10MB
(in the worst case scenario it's 200m x 10mb = 2pb) in HDFS, if that's not
enough, we can dial up the threshold.
It seems nobody understood what I am suggesting. Depending on use case and data
distribution you pick a threshold X. Blobs with a size of < X are stored
directly in HBase as a column value. Blobs >= X are stored in a HDFS with a
reference in HBase using the 3-phase approach.
bq. there are two HDFS blob + HBase metadata solutions are explicitly mentioned
in section 4.1.2 (v4 design doc) with pros and cons
True, but as I state the "store small blobs by value and only large ones by
reference" solution is not mentioned in there.
bq. The solution you propose is actually the first described hdfs+hbase approach
Not it's not... It says either all blobs go into HBase or all blobs go into
HDFS... See above. Small blobs would be stored directly in HBase, not in HDFS.
That's key, nobody wants to store 100k or 1mb files directly in HDFS.
bq. We have total 3 +1s for that Jira after many rounds of review rework. Can
get it committed tomorrow IST unless objections...?
We won't get this committed until we finish this discussion. So consider this
my -1 until we finish.
Going by the comments the use case is only 1-5mb files (definitely less than
64mb), correct? That changes the discussion, but it looks to me that now the
use case is limited to a single scenario and carefully constructed (200m x 500k
files) so that this change might be useful. I.e. pick a blob size just right,
and pick the size distribution of the files just right and this makes sense.
In my approach one can dial up/down the threshold of by-value and by-reference
storage as needed. And I did not even realize the need for M/R.
I do agree with all of following:
* snapshots are harder
* bulk load is harder
* backup/restore/replication is harder
Yet, all that is possible to do with a client only solution and could be
abstracted there.
I'll also admit that our blob storage tool is not finished, yet, and that for
its use case we don't need replication or backup as it itself will be the
backup solution for another very large data store.
Are you guys absolutely... 100%... positive that this cannot be done in any
other way and has to be done this way? That we cannot store files up to a
certain size as values in HBase and larger files in HDFS? And there is not good
threshold value for this?
> HBase MOB
> ---------
>
> Key: HBASE-11339
> URL: https://issues.apache.org/jira/browse/HBASE-11339
> Project: HBase
> Issue Type: Umbrella
> Components: regionserver, Scanners
> Reporter: Jingcheng Du
> Assignee: Jingcheng Du
> Attachments: HBase MOB Design-v2.pdf, HBase MOB Design-v3.pdf, HBase
> MOB Design-v4.pdf, HBase MOB Design.pdf, MOB user guide.docx, MOB user
> guide_v2.docx, hbase-11339-in-dev.patch
>
>
> It's quite useful to save the medium binary data like images, documents
> into Apache HBase. Unfortunately directly saving the binary MOB(medium
> object) to HBase leads to a worse performance since the frequent split and
> compaction.
> In this design, the MOB data are stored in an more efficient way, which
> keeps a high write/read performance and guarantees the data consistency in
> Apache HBase.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)