[
https://issues.apache.org/jira/browse/TRAFODION-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15838748#comment-15838748
]
ASF GitHub Bot commented on TRAFODION-2455:
-------------------------------------------
GitHub user DaveBirdsall opened a pull request:
https://github.com/apache/incubator-trafodion/pull/929
[TRAFODION-2455] Add retry to row count estimation logic
This set of changes does the following:
1. Changes the stack from NATable::estimateHBaseRowCount on down to return
detailed error information about any failure.
2. Changes NATable::estimateHBaseRowCount to return a rowcount of 100
million instead of zero when an error occurs. It is safer to overestimate the
size of an object than to underestimate it.
3. Changes UPDATE STATISTICS to give an error 9252 "Unable to get row count
estimate: Error code $0int0, detail $1int1. Exception info (if any):
$0~string0" when an error occurs in or underneath
NATable::estimateHBaseRowCount instead of using a rowcount estimate of zero.
The information in this message gives details such as what error path was
taken, and any Java exception information that may be pertinent.
4. Adds a retry loop to HBaseClient.java method estimateRowCount so that we
retry if we encounter a FileNotFoundException. UPDATE STATISTICS will do up to
4 minutes worth of accumulated retries; normal compilation will do up to 5
seconds worth of accumulated retries. Wait times for retries start out at 2
seconds, doubling until topping out at 30 seconds.
5. Adds timestamps to messages in update statistics logging in local time.
To get local time, I needed to fix a bug in the monitor (process.cxx) that was
incorrectly setting an environment variable TZ to the empty string when TZ was
not defined in the monitor itself.
Note: The fix in process.cxx might (or might not!) fix a similar bug in DTM
logging where local timestamps are desired but UTC timestamps are produced.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/DaveBirdsall/incubator-trafodion
Trafodion2455x
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-trafodion/pull/929.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #929
----
commit 932c219f0fd2a49e1c53a666a1dcbcb3578c4f91
Author: Dave Birdsall <[email protected]>
Date: 2017-01-19T20:10:33Z
[TRAFODION-2440] Add retry to row count estimation logic
commit f1653636dd6d411fafe79697ee75c8763ff8edf2
Author: Dave Birdsall <[email protected]>
Date: 2017-01-24T00:09:08Z
Comment change
commit dbc6c0876ca5af07b63d68f388bad07195d106a5
Author: Dave Birdsall <[email protected]>
Date: 2017-01-25T22:40:56Z
[TRAFODION-2455] More refinements to row count estimation retry logic
----
> Initial Update Stats on 22B row 2.5TB OE table gets 0 rowcount from
> estimator, fails with timeouts by doing select count (*)
> ----------------------------------------------------------------------------------------------------------------------------
>
> Key: TRAFODION-2455
> URL: https://issues.apache.org/jira/browse/TRAFODION-2455
> Project: Apache Trafodion
> Issue Type: Bug
> Components: sql-cmp
> Affects Versions: 2.1-incubating
> Environment: A cluster large enough to host a 22 billion row table
> Reporter: David Wayne Birdsall
> Assignee: David Wayne Birdsall
>
> When loading a scale factor 73728 Order Entry database, if UPDATE STATISTICS
> is done soon after the load on one particular table (the largest table,
> having 22 billion rows), we get the following failure:
> SQLEXCEPTION on Statement, Error Code = -9200
> update statistics for table trafodion.javabench.oe_orderline_73728 on
> every column, (OL_W_ID, OL_I_ID), (OL_D_ID, OL_W_ID), (OL_D_ID, OL_I_ID)
> sample
> *** ERROR[9200] UPDATE STATISTICS for table
> TRAFODION.JAVABENCH.OE_ORDERLINE_73728 encountered an error (8448) from
> statement getRow(). [2017-01-09 02:07:22]
> *** ERROR[8448] Unable to access Hbase interface. Call to
> ExpHbaseInterface::coProcAggr returned error HBASE_ACCESS_ERROR(-706). Cause:
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> attempts=3, exceptions:
> Mon Jan 09 01:47:21 PST 2017,
> RpcRetryingCaller{globalStartTime=1483954641419, pause=100, retries=3},
> java.io.IOException: Call to nap015.esgyn.local/10.1.10.20:60020 failed on
> local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call
> id=73, waitTime=600001, operationTimeout=600000 expired.
> Mon Jan 09 01:57:21 PST 2017,
> RpcRetryingCaller{globalStartTime=1483954641419, pause=100, retries=3},
> java.io.IOException: Call to nap015.esgyn.local/10.1.10.20:60020 failed on
> local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call
> id=185, waitTime=600001, operationTimeout=600000 expired.
> Mon Jan 09 02:07:22 PST 2017,
> RpcRetryingCaller{globalStartTime=1483954641419, pause=100, retries=3},
> java.io.IOException: Call to nap015.esgyn.local/10.1.10.20:60020 failed on
> local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call
> id=310, waitTime=600001, operationTimeout=600000 expired.
> A subsequent update statistics command succeeds, but these failures take a
> half hour or more.
> Enabling logging for update stats shows that getrowcount returns 0, so update
> stats assumes the table is small enough to do a select count (*). The plan
> for this select count (*) (perhaps suffering from the same issue that causes
> getrowcount to return a non-estimate) chooses the HBase aggregate
> coprocessor. The table in question has 22 billion rows, so the the
> coprocessor isn't a good choice, and the query times out. But the real issue
> is, why can't the table get a rowcount estimate.
> Rerunning UPDATE STATS on this table a few hours later succeeds.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)