[
https://issues.apache.org/jira/browse/HBASE-26238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421896#comment-17421896
]
Hudson commented on HBASE-26238:
--------------------------------
Results for branch branch-2.4
[build #205 on
builds.a.o|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.4/205/]:
(/) *{color:green}+1 overall{color}*
----
details (if available):
(/) {color:green}+1 general checks{color}
-- For more information [see general
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.4/205/General_20Nightly_20Build_20Report/]
(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2)
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.4/205/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]
(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3)
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.4/205/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.4/205/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(/) {color:green}+1 source release artifact{color}
-- See build output for details.
(/) {color:green}+1 client integration test{color}
> OOME in VerifyReplication for the table contains rows with 10M+ cells
> ---------------------------------------------------------------------
>
> Key: HBASE-26238
> URL: https://issues.apache.org/jira/browse/HBASE-26238
> Project: HBase
> Issue Type: Bug
> Components: Client, Replication
> Reporter: Shinya Yoshida
> Assignee: Shinya Yoshida
> Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-2, 2.3.7, 2.4.7
>
>
> We have a table that contains rows with 10M+ cells.
> When we run VerifyReplication for that table, we got OOME.
> VerifyReplication cannot be complete without OOME even though we provide 31GB
> heap for each mapper despite of RS can handle such get request.
> {code:java}
> org.apache.hadoop.mapred.YarnChild: Error running child :
> java.lang.OutOfMemoryError
> at
> java.lang.AbstractStringBuilder.hugeCapacity(AbstractStringBuilder.java:161)
> at
> java.lang.AbstractStringBuilder.newCapacity(AbstractStringBuilder.java:155)
> at
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:125)
> at
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
> at java.lang.StringBuilder.append(StringBuilder.java:136)
> at
> org.apache.hadoop.hbase.client.Result.compareResults(Result.java:844)
> at
> org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication$Verifier.map(VerifyReplication.java:184)
> at
> org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication$Verifier.map(VerifyReplication.java:95)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> {code}
> The interesting thing is always failing at
> AbstractStringBuilder.hugeCapacity(AbstractStringBuilder.java:161) in
> Result.compareResults.
> This is an application-side OOME and caused by the max size of Java array, so
> we cannot avoid this error whatever heap size we use.
> {code:java}
> private int hugeCapacity(int minCapacity) {
> if (Integer.MAX_VALUE - minCapacity < 0) { // overflow
> throw new OutOfMemoryError();
> }
> return (minCapacity > MAX_ARRAY_SIZE)
> ? minCapacity : MAX_ARRAY_SIZE;
> }
> {code}
> When we see Result.compareResults, it generates a string representation of
> all cells in 2 results for the exception message.
> This could be a very large string despite cells in the result could consist
> from multiple byte arrays and are well optimized in the heap.
> (Repeated rowkeys in string, 4 chars for 1 byte if it cannot be represented
> by ascii char, long timestamp(8 bytes) vs string timestamp(13 bytes so far),
> and so on.)
> {code:java}
> public static void compareResults(Result res1, Result res2)
> throws Exception {
> if (res2 == null) {
> throw new Exception("There wasn't enough rows, we stopped at "
> + Bytes.toStringBinary(res1.getRow()));
> }
> if (res1.size() != res2.size()) {
> throw new Exception("This row doesn't have the same number of KVs: "
> + res1.toString() + " compared to " + res2.toString());
> }
> Cell[] ourKVs = res1.rawCells();
> Cell[] replicatedKVs = res2.rawCells();
> for (int i = 0; i < res1.size(); i++) {
> if (!ourKVs[i].equals(replicatedKVs[i]) ||
> !CellUtil.matchingValue(ourKVs[i], replicatedKVs[i]) ||
> !CellUtil.matchingTags(ourKVs[i], replicatedKVs[i])) {
> throw new Exception("This result was different: "
> + res1.toString() + " compared to " + res2.toString());
> }
> }
> }
> {code}
> In VerifyReplication, the exception thrown is never used.
> So this message is useless and a white elephant for VerifyReplication and us.
> Can we provide a version that produces lightweight message (or returns
> boolean instead of exception, but I think it's confusing similar method that
> returns boolean and another throws exception)?
> (Such hot row can have mutations often and be considered as inconsistent due
> to timing issues, so difficult to avoid this OOME by just a luck)
> {code:java}
> try {
> Result.compareResults(value, currentCompareRowInPeerTable);
> context.getCounter(Counters.GOODROWS).increment(1);
> if (verbose) {
> LOG.info("Good row key: " + delimiter
> + Bytes.toStringBinary(value.getRow()) + delimiter);
> }
> } catch (Exception e) {
> logFailRowAndIncreaseCounter(context,
> Counters.CONTENT_DIFFERENT_ROWS, value);
> }
> try {
> Result sourceResult = sourceTable.get(new Get(row.getRow()));
> Result replicatedResult = replicatedTable.get(new
> Get(row.getRow()));
> Result.compareResults(sourceResult, replicatedResult);
> if (!sourceResult.isEmpty()) {
> context.getCounter(Counters.GOODROWS).increment(1);
> if (verbose) {
> LOG.info("Good row key (with recompare): " + delimiter +
> Bytes.toStringBinary(row.getRow())
> + delimiter);
> }
> }
> return;
> } catch (Exception e) {
> LOG.error("recompare fail after sleep, rowkey=" + delimiter +
> Bytes.toStringBinary(row.getRow()) + delimiter);
> }
> {code}
> VerifyReplication with the patch that changes the message to include rowkey
> only can handle our table without OOME and even smaller heap.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)