[performance] Distributed splitting of regionserver commit logs
---------------------------------------------------------------
Key: HBASE-1364
URL: https://issues.apache.org/jira/browse/HBASE-1364
Project: Hadoop HBase
Issue Type: Improvement
Reporter: stack
Priority: Critical
HBASE-1008 has some improvements to our log splitting on regionserver crash;
but it needs to run even faster.
(Below is from HBASE-1008)
In bigtable paper, the split is distributed. If we're going to have 1000 logs,
we need to distribute or at least multithread the splitting.
1. As is, regions starting up expect to find one reconstruction log only. Need
to make it so pick up a bunch of edit logs and it should be fine that logs are
elsewhere in hdfs in an output directory written by all split participants
whether multithreaded or a mapreduce-like distributed process (Lets write our
distributed sort first as a MR so we learn whats involved; distributed sort, as
much as possible should use MR framework pieces). On startup, regions go to
this directory and pick up the files written by split participants deleting and
clearing the dir when all have been read in. Making it so can take multiple
logs for input, can also make the split process more robust rather than current
tenuous process which loses all edits if it doesn't make it to the end without
error.
2. Each column family rereads the reconstruction log to find its edits. Need to
fix that. Split can sort the edits by column family so store only reads its
edits.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.