[ 
https://issues.apache.org/jira/browse/PHOENIX-4674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453267#comment-16453267
 ] 

James Taylor commented on PHOENIX-4674:
---------------------------------------

Thanks for the test, [~abhishek.chouhan]. I tweaked it slightly - the current 
behavior is working as designed. The statistics reported are meant to be an 
upper bound of the amount of data scanned. In this case, statistics have been 
collected, but we know we have less than a guideposts width. So we use the 
guideposts width as the bytes scanned and estimate the row count based on our 
row width estimate. We could use 0 as the estimate of bytes/rows scanned, but 
the disadvantage would be if a very large guidepost width is configured, there 
actually may be a sizeable amount of data to scan (and the user would be given 
no indication of that).

> Incorrect stats if data size is less than guidepost width
> ---------------------------------------------------------
>
>                 Key: PHOENIX-4674
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4674
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: Abhishek Singh Chouhan
>            Assignee: Abhishek Singh Chouhan
>            Priority: Major
>         Attachments: PHOENIX-4674.patch
>
>
> For a small table, lets say with a single region < guidepost width, the stats 
> after running update statistics can be way off. This is because we get an 
> empty guidepost for the region and in BaseResultIterators we end up 
> estimating the #rows as guidepostwidth/estimated row size of the table. For a 
> table having <100 rows and guidepost width size of 100 mb, if the estimated 
> row size is 100 bytes we end up estimating a million rows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to