[ 
https://issues.apache.org/jira/browse/HBASE-11339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037015#comment-14037015
 ] 

Jingcheng Du commented on HBASE-11339:
--------------------------------------

[~jmhsieh], and [[email protected]], thanks for the comments.
Think about the suggestion carefully, and have some ideas. Share with all of 
you guys, and please kindly provide comments. According to the suggestion, I'll 
name the Lob as Mob from now on.

We don't use the MemStore to save the mob data, we directly write the to the 
mob file and just for once.

In the prePut of the coprocessor, the KV are split to two KVs, one(KV0) is the 
offset+path, the other one(KV1) is the lob KV. KV0 is written to the HLog and 
MemStore, and KV1 is written to the mob file.
Before the mob data are async to the disk, they are saved in the buffer of the 
mob writer, these data are not seekable until the buffer is full or sync to the 
disk.
In order to avoid this, we have to sync the mob data for each put to the disk 
(is it ok to sync for the mob in each put? The mob data are usually pictures, 
the size is around 1-5MB).

By design, each store has a single mob file for writing. We have to synchronize 
the operation to increase the offset of KVs within a single mob file. So we 
have to have a synchronization block(two operations in the block, one is the 
sync the mob data to disk, the other is to increase the offset) in the prePut 
method, consequently all the puts are synchronized here. This is not efficient. 
Instead we could improve it here, to use different mob files for each thread. 
If so we don't need synchronization, but we will have too many open files in 
region server (handler*regionNum). This is a problem.
Also we have a solution for this, we could define a SynchronousQueue with 
limited size so that we could have limited open files for each region. All of 
these occurs in prePut, and the prePut method should have a synchronization 
block in each thread. It's improved, but not efficient IMO.

Before the MemStore flushes(do this in the preFlush of coprocessor), we roll 
the mob writers and update the KV offset to 0 for new writers. This will block 
the prePut.

Usually by the requirements of customers, using the TTL to clean expired mob 
files are very important, it's more efficient to clean the mob files than the 
sweep tool(mob files are hardly updated, but have a fixed life time).
We need a way to rename the mob files before the MemStore flushes in the store 
flusher, and save these mob files by date.
Such a situation probably happens: The MemStore flushing fails while the mob 
files renaming succeeds. When the WALEdits are replayed, the connection between 
the edits and mob files are lost. In order to avoid this, we need to add a 
rename-transaction znode to zk, each renaming transaction has a znode which 
contains several child znodes(they're the mapping from the nameBeforeRename to 
nameAfterRename). The txn znode will be deleted after every successful MemStore 
flushing and all the txns for each store are exclusive to each other.

How about this?

> HBase MOB
> ---------
>
>                 Key: HBASE-11339
>                 URL: https://issues.apache.org/jira/browse/HBASE-11339
>             Project: HBase
>          Issue Type: New Feature
>          Components: regionserver, Scanners
>            Reporter: Jingcheng Du
>            Assignee: Jingcheng Du
>         Attachments: HBase LOB Design.pdf
>
>
>   It's quite useful to save the massive binary data like images, documents 
> into Apache HBase. Unfortunately directly saving the binary LOB(large object) 
> to HBase leads to a worse performance since the frequent split and compaction.
>   In this design, the LOB data are stored in an more efficient way, which 
> keeps a high write/read performance and guarantees the data consistency in 
> Apache HBase.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to