[
https://issues.apache.org/jira/browse/PHOENIX-4008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640407#comment-16640407
]
Bin Shi commented on PHOENIX-4008:
----------------------------------
[~aertoria], thanks for the comments. See the answers and some of complements
below.
# Yes, we reached the agreement that have stats to collect all versions of
data. While we think it's ok to commit this change and tolerate the sub-optimum
performance of table sampling for now, I have opened JIRA PHOENIX-4912 in which
I proposed an enhanced table sampling algorithm to address "Make Table Sampling
algorithm to accommodate to the imbalance row distribution across guide posts"
issue.
# This Jira only addresses "collecting multiple versions of cells" and I
opened another Jira (PHOENIX-4913) to track the issue of having stats to
collect the deleted rows.
# With stats collecting all versions of cells (excluding the delete rows),
before and after major compaction, both the current table sampling algorithm
and the enhanced algorithm proposed in PHOENIX-4912 are *consistent*.
# With stats collecting the deleted rows, before and after major compaction,
both the current table sampling algorithm and the enhanced algorithm proposed
in PHOENIX-4912 are *inconsistent*.
# [~aertoria], could you give me more details of "This is useful for example
when a user want to efficiently compare if table matches. People starting
asking about it since 4.13."? For now, I'm not convinced that tabling sampling
can be applied to table matching.
# If a table splits, in my opinion, guide posts need NOT to be re-written, but
when we generate parallel scans, we need to be careful about the case that the
region boundary and the guide post boundary aren't aligned.
> UPDATE STATISTIC should collect all versions of cells
> -----------------------------------------------------
>
> Key: PHOENIX-4008
> URL: https://issues.apache.org/jira/browse/PHOENIX-4008
> Project: Phoenix
> Issue Type: Bug
> Reporter: Samarth Jain
> Assignee: Bin Shi
> Priority: Major
> Fix For: 4.15.0, 5.1.0
>
> Attachments: PHOENIX-4008_0918.patch, PHOENIX-4008_0920.patch,
> PHONEIX-4008.4.X-HBase-1.2.001.patch, PHONEIX-4008.4.X-HBase-1.3.001.patch,
> PHONEIX-4008.4.X-HBase-1.4.001.patch
>
>
> In order to truly measure the size of data when calculating guide posts,
> UPDATE STATISTIC should taken into account all versions of cells. We should
> also be setting the max versions on the scan.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)