[jira] [Commented] (HBASE-26238) OOME in VerifyReplication for the table contains rows with 10M+ cells

Hudson (Jira) Tue, 28 Sep 2021 19:30:45 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-26238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421896#comment-17421896
 ]


Hudson commented on HBASE-26238:
--------------------------------

Results for branch branch-2.4
        [build #205 on 
builds.a.o|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.4/205/]:
 (/) *{color:green}+1 overall{color}*
----
details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.4/205/General_20Nightly_20Build_20Report/]




(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.4/205/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.4/205/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.4/205/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> OOME in VerifyReplication for the table contains rows with 10M+ cells
> ---------------------------------------------------------------------
>
>                 Key: HBASE-26238
>                 URL: https://issues.apache.org/jira/browse/HBASE-26238
>             Project: HBase
>          Issue Type: Bug
>          Components: Client, Replication
>            Reporter: Shinya Yoshida
>            Assignee: Shinya Yoshida
>            Priority: Major
>             Fix For: 2.5.0, 3.0.0-alpha-2, 2.3.7, 2.4.7
>
>
> We have a table that contains rows with 10M+ cells.
> When we run VerifyReplication for that table, we got OOME.
> VerifyReplication cannot be complete without OOME even though we provide 31GB 
> heap for each mapper despite of RS can handle such get request.
> {code:java}
> org.apache.hadoop.mapred.YarnChild: Error running child : 
> java.lang.OutOfMemoryError
>         at 
> java.lang.AbstractStringBuilder.hugeCapacity(AbstractStringBuilder.java:161)
>         at 
> java.lang.AbstractStringBuilder.newCapacity(AbstractStringBuilder.java:155)
>         at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:125)
>         at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
>         at java.lang.StringBuilder.append(StringBuilder.java:136)
>         at 
> org.apache.hadoop.hbase.client.Result.compareResults(Result.java:844)
>         at 
> org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication$Verifier.map(VerifyReplication.java:184)
>         at 
> org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication$Verifier.map(VerifyReplication.java:95)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> {code}
> The interesting thing is always failing at 
> AbstractStringBuilder.hugeCapacity(AbstractStringBuilder.java:161) in 
> Result.compareResults.
> This is an application-side OOME and caused by the max size of Java array, so 
> we cannot avoid this error whatever heap size we use.
> {code:java}
>     private int hugeCapacity(int minCapacity) {
>         if (Integer.MAX_VALUE - minCapacity < 0) { // overflow
>             throw new OutOfMemoryError();
>         }
>         return (minCapacity > MAX_ARRAY_SIZE)
>             ? minCapacity : MAX_ARRAY_SIZE;
>     }
> {code}
> When we see Result.compareResults, it generates a string representation of 
> all cells in 2 results for the exception message.
> This could be a very large string despite cells in the result could consist 
> from multiple byte arrays and are well optimized in the heap.
> (Repeated rowkeys in string, 4 chars for 1 byte if it cannot be represented 
> by ascii char, long timestamp(8 bytes) vs string timestamp(13 bytes so far), 
> and so on.)
> {code:java}
>   public static void compareResults(Result res1, Result res2)
>       throws Exception {
>     if (res2 == null) {
>       throw new Exception("There wasn't enough rows, we stopped at "
>           + Bytes.toStringBinary(res1.getRow()));
>     }
>     if (res1.size() != res2.size()) {
>       throw new Exception("This row doesn't have the same number of KVs: "
>           + res1.toString() + " compared to " + res2.toString());
>     }
>     Cell[] ourKVs = res1.rawCells();
>     Cell[] replicatedKVs = res2.rawCells();
>     for (int i = 0; i < res1.size(); i++) {
>       if (!ourKVs[i].equals(replicatedKVs[i]) ||
>           !CellUtil.matchingValue(ourKVs[i], replicatedKVs[i]) ||
>           !CellUtil.matchingTags(ourKVs[i], replicatedKVs[i])) {
>         throw new Exception("This result was different: "
>             + res1.toString() + " compared to " + res2.toString());
>       }
>     }
>   }
> {code}
> In VerifyReplication, the exception thrown is never used.
> So this message is useless and a white elephant for VerifyReplication and us.
> Can we provide a version that produces lightweight message (or returns 
> boolean instead of exception, but I think it's confusing similar method that 
> returns boolean and another throws exception)?
> (Such hot row can have mutations often and be considered as inconsistent due 
> to timing issues, so difficult to avoid this OOME by just a luck)
> {code:java}
>           try {
>             Result.compareResults(value, currentCompareRowInPeerTable);
>             context.getCounter(Counters.GOODROWS).increment(1);
>             if (verbose) {
>               LOG.info("Good row key: " + delimiter
>                   + Bytes.toStringBinary(value.getRow()) + delimiter);
>             }
>           } catch (Exception e) {
>             logFailRowAndIncreaseCounter(context, 
> Counters.CONTENT_DIFFERENT_ROWS, value);
>           }
>         try {
>           Result sourceResult = sourceTable.get(new Get(row.getRow()));
>           Result replicatedResult = replicatedTable.get(new 
> Get(row.getRow()));
>           Result.compareResults(sourceResult, replicatedResult);
>           if (!sourceResult.isEmpty()) {
>             context.getCounter(Counters.GOODROWS).increment(1);
>             if (verbose) {
>               LOG.info("Good row key (with recompare): " + delimiter + 
> Bytes.toStringBinary(row.getRow())
>               + delimiter);
>             }
>           }
>           return;
>         } catch (Exception e) {
>           LOG.error("recompare fail after sleep, rowkey=" + delimiter +
>               Bytes.toStringBinary(row.getRow()) + delimiter);
>         }
> {code}
> VerifyReplication with the patch that changes the message to include rowkey 
> only can handle our table without OOME and even smaller heap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-26238) OOME in VerifyReplication for the table contains rows with 10M+ cells

Reply via email to