[jira] [Comment Edited] (HBASE-10216) Change HBase to support local compactions

Andrew Purtell (JIRA) Mon, 09 Feb 2015 17:04:34 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14313284#comment-14313284
 ]


Andrew Purtell edited comment on HBASE-10216 at 2/10/15 1:01 AM:
-----------------------------------------------------------------

We could propose a new HDFS API that "would merge files so that the merging and 
deleting can be performed on local data nodes with no file contents moving over 
the network", but does this not only push something implemented today in the 
HBase regionserver down into the HDFS datanodes? Could a merge as described be 
safely executed in parallel on multiple datanodes without coordination? No, 
because the result is not a 1:1 map of input block to output block. Therefore 
in a realistic implementation (IMHO) a single datanode would handle the merge 
procedure. From a block device and network perspective nothing would change.

Set the above aside. We can't push something as critical to HBase as compaction 
down into HDFS. First, the HDFS project is unlikely to accept the idea or 
implement it in the first place. Even in the unlikely event that happens, we 
would need reimplement compaction using the new HDFS facility to take advantage 
of it, yet we will need to support older versions of HDFS without the new API 
for a while, and if the new HDFS API ever doesn't perfectly address the 
minutiae of HBase compaction then or going forward we would be back where we 
started. 

Let's look at the read and write aspects with an eye toward what we have today, 
and assuming no new HDFS API.

Reads: With short circuit reads enabled, recommended for all deployments, if 
file blocks are available on the local datanode then block reads are fully 
local via a file descriptor passed over a unix domain socket, we never touch a 
TCP/IP socket. The probability that a block read for an HFile is local can be 
made very high by taking care to align region placement with block placement 
and/or fix up where block locality has dropped below a threshold using an 
existing HDFS API, see HBASE-4755 and HDFS-4606 

Writes: Writers like regionservers always contact the local datanode, assuming 
colocation of datanode and regionserver, as the first hop in the write 
pipeline. The datanode will then pipeline the write over the network to 
replicas, but only the second hop in the pipeline (from local datanode to first 
remote replica) will add contention on the local NIC, the third (from remote 
replica to other remote replica) will be pipelined from the remote. It's true 
we can initially avoid second-replica network IO by writing to a local file. Or 
we can have the equivalent in HDFS by setting the initial replication factor of 
the new file to 1.  In either case after closing the file, to make the result 
robust against node loss, we need to replicate all blocks of the newly written 
file immediately afterward. So now we are waiting for network IO and contending 
the NIC anyway, we have just deferred network IO until the file was completely 
written. We are not saving a single byte in transmission on the local NIC. We 
would have to add housekeeping that insures we don't delete older HFiles until 
the new/merged HFile is completely replicated; this makes something our 
business that today is transparent because we don't defer writes, when close() 
on the file we are writing to HDFS directly completes we know it has been fully 
replicated already. 

For us to see any significant impact, I think the proposal on this issue must 
be replaced with one where we flush from memstore to local files and then at 
some point merge locally flushed files to a compacted file on HDFS. Only then 
are we really saving on IO. All of those locally flushed files represent data 
that never leaves the local node, never crosses the network, never causes reads 
or writes beyond the local node. This is the benefit *and* the nature of the 
data availability problem that follows: We can't consider locally flushed files 
as persisted data. If a node crashes before they are compacted they are lost 
(until the node comes back online... maybe), or if a local file is corrupted 
before compaction the data inside is also lost. We can only consider flushed 
data persisted after a completed compaction, only after the compaction result 
is fully replicated in HDFS. We somehow have to track all of the data in local 
flush files and insure it has all been compacted before deleting the WALs that 
contain those edits. We somehow need to detect when local flush files after 
node recovery are stale. Etc etc. Will the savings be worth the added 
complexity and additional failure modes? Maybe, but I believe Facebook 
published a paper on this that was inconclusive. 


was (Author: apurtell):
We could propose a new HDFS API that "would merge files so that the merging and 
deleting can be performed on local data nodes with no file contents moving over 
the network", but does this not only push something implemented today in the 
HBase regionserver down into the HDFS datanodes? Could a merge as described be 
safely executed in parallel on multiple datanodes without coordination? No, 
because the result is not a 1:1 map of input block to output block. Therefore 
in a realistic implementation (IMHO) a single datanode would handle the merge 
procedure. From a block device and network perspective nothing would change.

Set the above aside. We can't push something as critical to HBase as compaction 
down into HDFS. First, the HDFS project is unlikely to accept the idea or 
implement it in the first place. Even in the unlikely event that happens, we 
would need reimplement compaction using the new HDFS facility to take advantage 
of it, yet we will need to support older versions of HDFS without the new API 
for a while, and if the new HDFS API ever doesn't perfectly address the 
minutiae of HBase compaction then or going forward we would be back where we 
started. 

Let's look at the read and write aspects with an eye toward what we have today, 
and assuming no new HDFS API.

Reads: With short circuit reads enabled, recommended for all deployments, if 
file blocks are available on the local datanode then block reads are fully 
local via a file descriptor passed over a unix domain socket, we never touch a 
TCP/IP socket. The probability that a block read for an HFile is local can be 
made very high by taking care to align region placement with block placement 
and/or fix up where block locality has dropped below a threshold using an 
existing HDFS API, see HBASE-4755 and HDFS-4606 

Writes: Writers like regionservers always contact the local datanode, assuming 
colocation of datanode and regionserver, as the first hop in the write 
pipeline. The datanode will then pipeline the write over the network to 
replicas, but only the second hop in the pipeline (from local datanode to first 
remote replica) will add contention on the local NIC, the third (from remote 
replica to other remote replica) will be pipelined from the remote. It's true 
we can initially avoid second-replica network IO initially by writing to a 
local file. Or we can have the equivalent in HDFS by setting the initial 
replication factor of the new file to 1.  In either case after closing the 
file, to make the result robust against node loss, we need to replicate all 
blocks of the newly written file immediately afterward. So now we are waiting 
for network IO and contending the NIC anyway, we have just deferred network IO 
until the file was completely written. We are not saving a single byte in 
transmission on the local NIC. We would have to add housekeeping that insures 
we don't delete older HFiles until the new/merged HFile is completely 
replicated; this makes something our business that today HDFS handles 
transparently.

For us to see any significant impact, I think the proposal on this issue must 
be replaced with one where we flush from memstore to local files and then at 
some point merge locally flushed files to a compacted file on disk. Only then 
are we really saving on IO. All of those locally flushed files represent data 
that never leaves the local node, never crosses the network, never causes reads 
or writes beyond the local node. This is the benefit *and* the nature of the 
data availability problem that follows: We can't consider locally flushed files 
as persisted data. If a node crashes before they are compacted they are lost 
(until the node comes back online... maybe), or if a local file is corrupted 
before compaction the data inside is also lost. We can only consider flushed 
data persisted after a completed compaction, only after the compaction result 
is fully replicated in HDFS. We somehow have to track all of the data in local 
flush files and insure it has all been compacted before deleting the WALs that 
contain those edits. We somehow need to detect when local flush files after 
node recovery are stale. Etc etc. Will the savings be worth the added 
complexity and additional failure modes? Maybe, but I believe Facebook 
published a paper on this that was inconclusive. 

> Change HBase to support local compactions
> -----------------------------------------
>
>                 Key: HBASE-10216
>                 URL: https://issues.apache.org/jira/browse/HBASE-10216
>             Project: HBase
>          Issue Type: New Feature
>          Components: Compaction
>         Environment: All
>            Reporter: David Witten
>
> As I understand it compactions will read data from DFS and write to DFS.  
> This means that even when the reading occurs on the local host (because 
> region server has a local copy) all the writing must go over the network to 
> the other replicas.  This proposal suggests that HBase would perform much 
> better if all the reading and writing occurred locally and did not go over 
> the network. 
> I propose that the DFS interface be extended to provide method that would 
> merge files so that the merging and deleting can be performed on local data 
> nodes with no file contents moving over the network.  The method would take a 
> list of paths to be merged and deleted and the merged file path and an 
> indication of a file-format-aware class that would be run on each data node 
> to perform the merge.  The merge method provided by this merging class would 
> be passed files open for reading for all the files to be merged and one file 
> open for writing.  The custom class provided merge method would read all the 
> input files and append to the output file using some standard API that would 
> work across all DFS implementations.  The DFS would ensure that the merge had 
> happened properly on all replicas before returning to the caller.  It could 
> be that greater resiliency could be achieved by implementing the deletion as 
> a separate phase that is only done after enough of the replicas had completed 
> the merge. 
> HBase would be changed to use the new merge method for compactions, and would 
> provide an implementation of the merging class that works with HFiles.
> This proposal would require a custom code that understands the file format to 
> be runnable by the data nodes to manage the merge.  So there would need to be 
> a facility to load classes into DFS if there isn't such a facility already.  
> Or, less generally, HDFS could build in support for HFile merging.
> The merge method might be optional.  If the DFS implementation did not 
> provide it a generic version that performed the merge on top of the regular 
> DFS interfaces would be used.
> It may be that this method needs to be tweaked or ignored when the region 
> server does not have a local copy data so that, as happens currently, one 
> copy of the data moves to the region server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (HBASE-10216) Change HBase to support local compactions

Reply via email to