langdamao created HBASE-22887:
---------------------------------

             Summary: HFileOutputFormat2 split a lot of HFile by roll once per 
rowkey
                 Key: HBASE-22887
                 URL: https://issues.apache.org/jira/browse/HBASE-22887
             Project: HBase
          Issue Type: Bug
          Components: mapreduce
    Affects Versions: 2.0.0
         Environment: HBase 2.0.0
            Reporter: langdamao


When I use HFileOutputFormat2 in mr job to build HFiles,in reducer it creates 
lots of files.

Here is the log:
{code:java}
2019-08-16 14:42:51,988 INFO [main] 
org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2: 
Writer=hdfs://hfile/_temporary/1/_temporary/attempt_1558444096078_519332_r_000016_0/F1/06f3b0e9f0644ee782b7cf4469f44a70,
 wrote=893827310 
Writer=hdfs://hfile/_temporary/1/_temporary/attempt_1558444096078_519332_r_000016_0/F1/1454ea148f1547499209a266ad25387f,
 wrote=61 
Writer=hdfs://hfile/_temporary/1/_temporary/attempt_1558444096078_519332_r_000016_0/F1/9d35446634154b4ca4be56f361b57c8b,
 wrote=55 
...  {code}
It keep writing a new file every rowkey comes.
then I output more logs for detail and found the problem. Code 
Here[GitHub|[https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java#L289]]
{code:java}
if (wl != null && wl.written + length >= maxsize) {
  this.rollRequested = true;
}

// This can only happen once a row is finished though
if (rollRequested && Bytes.compareTo(this.previousRow, rowKey) != 0) {
  rollWriters(wl);
}{code}
In my Case,I have two fimaly F1 & F2,and writer of F2 arrives the maxsize
 ,so rollRequested becomes true, but it's rowkey was the same with previousRow 
so writer won't be roll. When next rowkey comes with fimaly F1, both of 
rollRequested && Bytes.compareTo(this.previousRow, rowKey) != 0 is true,and 
writter of F1 will be roll , new Hfile create. And then same rowkey with fimaly 
F2 comes set rollRequested
 true, and next rowkey with fimaly F1 comes writter of F1 rolled. 
So, it will create a new Hfile for every rowkey with fimaly F1, and F2 will 
never be roll until job ends.
 
Here is my questions and part of solutions:
Q1. First whether hbase 2.0.0 support different family of same HbaseTable has 
different rowkey cut?Which means rowkeyA writes in the first HFile of F1,but 
may be the second HFile of F2. For hbase 1.x.x it doesn't support so we roll 
all the writter and won't get this problem. I guess the answer is "Yes,support" 
, we goes to Q2.
Q2. Do we allow same rowkey with same family, comes to HFileOutputFormat2.write?
If not, can we fix it this way, cause this rowKey will never be the same with 
previouseRow
{code:java}
 if (wl != null && wl.written + length >= maxsize) { 
      rollWriters(wl);
 }{code}
If yes, should we need Map to record previouseRow
{code:java}
private final Map<byte[], byte[]> previousRows =
        new TreeMap<>(Bytes.BYTES_COMPARATOR);

if (wl != null && wl.written + length >= maxsize && 
Bytes.compareTo(this.previousRows.get(family), rowKey) != 0) { 
     rollWriters(wl); 
}{code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to