GitHub user DaveBirdsall opened a pull request:

    https://github.com/apache/trafodion/pull/1730

    [TRAFODION-3223] Don't scale down for non-Puts when estimating row counts

    The estimateRowCount code in HBaseClient.java tried to scale down row 
counts by the proportion of non-Put cells in the file. That is, it was trying 
to estimate row count from cell count, in part by discounting the effect of 
Delete tombstone cells. It was doing this on the basis of a sample of 500 rows 
in one HFile.
    
    We find, however, that with time-ordered data that is aged out, the Delete 
cells are not uniformly distributed but instead tend to clump in one place. If 
we are unlucky and get an HFile that begins with 500 Delete tombstones, we will 
incorrectly assume most of the table consists of deleted rows and drastically 
underestimate the number of rows.
    
    Drastically underestimating can be very bad. It is much better to 
overestimate. So the code that attempted to scale down row count based on the 
number of non-Put cells has been deleted. Also, if we find that the number of 
Puts in our sample is very small (< 50), we will instead ignore the sample and 
use the total number of entries.
    
    The changes described above are in HBaseClient.java.
    
    There are two other small, unrelated changes in this pull request as well:
    
    1. The regression test filter for filtering out SYSKEYS has been changed. 
The current minimum number of decimal digits in a SYSKEY is 15; the filter was 
assuming they were at least 16 digits. This would lead to regression failures 
if someone was very unlucky and got just the wrong Linux thread ID for their 
process.
    
    2. An uninitialized member of class ExRtFragTable is now initialized. This 
is a long-standing bug; the changes for pull request 
https://github.com/apache/trafodion/pull/1724 made it observable. For random 
parallel queries, the Executor GUI might come up at run time if the 
uninitialized value happened to be non-zero.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/DaveBirdsall/trafodion Trafodion3223

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/trafodion/pull/1730.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1730
    
----
commit 898812f84c510ab8798d5af6e3e63559f4078a07
Author: Dave Birdsall <dbirdsall@...>
Date:   2018-10-17T22:06:44Z

    [TRAFODION-3223] Don't scale down for non-Puts when estimating row counts

----


---

Reply via email to