[ 
https://issues.apache.org/jira/browse/PHOENIX-4008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640407#comment-16640407
 ] 

Bin Shi commented on PHOENIX-4008:
----------------------------------

[~aertoria], thanks for the comments. See the answers and some of complements 
below.
 # Yes, we reached the agreement that have stats to collect all versions of 
data. While we think it's ok to commit this change and tolerate the sub-optimum 
performance of table sampling for now, I have opened JIRA PHOENIX-4912 in which 
I proposed an enhanced table sampling algorithm to address "Make Table Sampling 
algorithm to accommodate to the imbalance row distribution across guide posts" 
issue.
 # This Jira only addresses "collecting multiple versions of cells" and I 
opened another Jira (PHOENIX-4913) to track the issue of having stats to 
collect the deleted rows.
 # With stats collecting all versions of cells (excluding the delete rows), 
before and after major compaction, both the current table sampling algorithm 
and the enhanced algorithm proposed in PHOENIX-4912 are *consistent*.
 # With stats collecting the deleted rows, before and after major compaction, 
both the current table sampling algorithm and the enhanced algorithm proposed 
in PHOENIX-4912 are *inconsistent*.
 # [~aertoria], could you give me more details of "This is useful for example 
when a user want to efficiently compare if table matches. People starting 
asking about it since 4.13."? For now, I'm not convinced that tabling sampling 
can be applied to table matching.
 # If a table splits, in my opinion, guide posts need NOT to be re-written, but 
when we generate parallel scans, we need to be careful about the case that the 
region boundary and the guide post boundary aren't aligned. 

> UPDATE STATISTIC should collect all versions of cells
> -----------------------------------------------------
>
>                 Key: PHOENIX-4008
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4008
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: Samarth Jain
>            Assignee: Bin Shi
>            Priority: Major
>             Fix For: 4.15.0, 5.1.0
>
>         Attachments: PHOENIX-4008_0918.patch, PHOENIX-4008_0920.patch, 
> PHONEIX-4008.4.X-HBase-1.2.001.patch, PHONEIX-4008.4.X-HBase-1.3.001.patch, 
> PHONEIX-4008.4.X-HBase-1.4.001.patch
>
>
> In order to truly measure the size of data when calculating guide posts, 
> UPDATE STATISTIC should taken into account all versions of cells. We should 
> also be setting the max versions on the scan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to