Csaba Ringhofer created IMPALA-11278:
----------------------------------------

             Summary: Cardinality of small HBase regions is overestimated since 
HBASE-26340
                 Key: IMPALA-11278
                 URL: https://issues.apache.org/jira/browse/IMPALA-11278
             Project: IMPALA
          Issue Type: Bug
          Components: Catalog, Frontend
    Affects Versions: Impala 4.1.0
            Reporter: Csaba Ringhofer


Impala uses the size of an HBase region to estimate the number of rows, and the 
API we use 
(https://hbase.apache.org/2.4/apidocs/org/apache/hadoop/hbase/RegionLoad.html#getStorefileSizeMB()
 ) returns a size at MB precision. Since HBASE-26340 it returns 1 instead of 0 
for very small but not empty tables, which leads to massively overestimating 
its size (we handle 0 in a special way. so we didn't estimate  row count as 0: 
https://github.com/apache/impala/blob/78609dca32d8ce996247c9552ba676a853c74686/fe/src/main/java/org/apache/impala/catalog/FeHBaseTable.java#L585
 )

In newer versions of HBase getStorefileSizeMB() is deprecated and there are 
functions to get the size at byte granulity. Using it could solve the massive 
overestimation, but it may make our planner tests more sensitive to small size 
changes in HBase regions.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to