[
https://issues.apache.org/jira/browse/HBASE-11339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117774#comment-14117774
]
Lars Hofhansl commented on HBASE-11339:
---------------------------------------
[[email protected]] and I talked about this at the HBase meetup...
I'm sorry to be the party pooper here, but this complexity and functionality
really does not belong into HBase IMHO.
I still do not get the motivation for this... Here's why:
# We still cannot stream the mobs. They have to be materialized at both the
server and the client (going by the documentation here)
# As I state above this can be achieved with a HBase/HDFS client alone and
better: Store mobs up to a certain size by value in HBase (say 5 or 10mb or
so), everything larger goes straight into HDFS with a reference only in HBase.
This addresses both the many small files issue in HDFS (only files larger than
5-10mb would end up in HDFS) and the streaming problem for large files in
HBase. Also as outlined by me in June we can still make this "transactional" in
the HBase sense with a three step protocol: (1) write reference row, (2) stream
blob to HDFS, (3) record location in HDFS (that's the commit). This solution is
also missing from the initial PDF in the "Existing Solutions" section.
# "Replication" here can still happen by the client, after all, each file
successfully stored in HDFS has a reference in HBase.
# We should use the tools what they were intended for. HBase for key value
storage, HDFS for streaming large blobs.
# Just saying using one client API for client convenience is *not* a reason to
put all of this into HBase. A client can easily speak both HBase and HDFS
protocols.
# (Subjectively) I do not like the complexity of this as seen by the various
discussions here. That part is just my $0.02 of course.
This looks to me like solution to a problem that we do not have.
Again I am sorry about being negative here, but we have to be careful what we
put into HBase and for what reasons.
Especially when there seems to be a *better* client only solution (in the sense
that it can deal with larger files, and allows for streaming the larger files).
If we need a solution for this, let's build one on top of HBase/HDFS. We
(Salesforce) are actually building a client only solution for this, it's not
that difficult (I will see whether we can open source this - it might be too
entangled with our internals). With an easy protocol we can still allow data
locality for all blob reads (as much as the block distribution allows it at
least), etc.
[~jesse_yates], maybe you want to add here?
If we cannot store 10mb Cells in HBase then that's something to address. The
fact that we cannot stream into and out of HBase needs to be addressed, that is
the real problem anyway.
> HBase MOB
> ---------
>
> Key: HBASE-11339
> URL: https://issues.apache.org/jira/browse/HBASE-11339
> Project: HBase
> Issue Type: Umbrella
> Components: regionserver, Scanners
> Reporter: Jingcheng Du
> Assignee: Jingcheng Du
> Attachments: HBase MOB Design-v2.pdf, HBase MOB Design-v3.pdf, HBase
> MOB Design-v4.pdf, HBase MOB Design.pdf, MOB user guide.docx, MOB user
> guide_v2.docx, hbase-11339-in-dev.patch
>
>
> It's quite useful to save the medium binary data like images, documents
> into Apache HBase. Unfortunately directly saving the binary MOB(medium
> object) to HBase leads to a worse performance since the frequent split and
> compaction.
> In this design, the MOB data are stored in an more efficient way, which
> keeps a high write/read performance and guarantees the data consistency in
> Apache HBase.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)