GitHub user DaveBirdsall opened a pull request:
https://github.com/apache/trafodion/pull/1730
[TRAFODION-3223] Don't scale down for non-Puts when estimating row counts
The estimateRowCount code in HBaseClient.java tried to scale down row
counts by the proportion of non-Put cells in the file. That is, it was trying
to estimate row count from cell count, in part by discounting the effect of
Delete tombstone cells. It was doing this on the basis of a sample of 500 rows
in one HFile.
We find, however, that with time-ordered data that is aged out, the Delete
cells are not uniformly distributed but instead tend to clump in one place. If
we are unlucky and get an HFile that begins with 500 Delete tombstones, we will
incorrectly assume most of the table consists of deleted rows and drastically
underestimate the number of rows.
Drastically underestimating can be very bad. It is much better to
overestimate. So the code that attempted to scale down row count based on the
number of non-Put cells has been deleted. Also, if we find that the number of
Puts in our sample is very small (< 50), we will instead ignore the sample and
use the total number of entries.
The changes described above are in HBaseClient.java.
There are two other small, unrelated changes in this pull request as well:
1. The regression test filter for filtering out SYSKEYS has been changed.
The current minimum number of decimal digits in a SYSKEY is 15; the filter was
assuming they were at least 16 digits. This would lead to regression failures
if someone was very unlucky and got just the wrong Linux thread ID for their
process.
2. An uninitialized member of class ExRtFragTable is now initialized. This
is a long-standing bug; the changes for pull request
https://github.com/apache/trafodion/pull/1724 made it observable. For random
parallel queries, the Executor GUI might come up at run time if the
uninitialized value happened to be non-zero.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/DaveBirdsall/trafodion Trafodion3223
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/trafodion/pull/1730.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1730
----
commit 898812f84c510ab8798d5af6e3e63559f4078a07
Author: Dave Birdsall <dbirdsall@...>
Date: 2018-10-17T22:06:44Z
[TRAFODION-3223] Don't scale down for non-Puts when estimating row counts
----
---