[
https://issues.apache.org/jira/browse/PHOENIX-4008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16639447#comment-16639447
]
Ethan Wang commented on PHOENIX-4008:
-------------------------------------
Hey [~Bin Shi] [~tdsilva] and [~karanmehta93], after read through the thread,
so this is some my additional thoughts .
TL;DR, I'm with [~tdsilva]'s earlier comments that that in principle whether
Stats should or should not collect deleted data should be consistent with its
use case. But I also think keep tracking that on table by table basis is a
overkill and will be nor worthy to maintain. So probably a system wide decision
should be made and keep it consistent.
Seems you guys have already reached an agreement that have stats to collect all
versions of data and tolerate the sub-optimum performance of table sampling. I
think this is a good choice, especially the performance gets reset to maximum
after each major compaction when stats gets refreshed.
One thing to watch out is that, another feature provided by Table sampling is
this consistent sampling, which means you can sample the same exact table many
times and it always give you the same sample set. This is useful for example
when a user want to efficiently compare if table matches. People starting
asking about it since 4.13.
Now if Stats table collects all versions of rows, after major compaction the
data table changes. I'm not sure if this will cause undeterministic
inconsistency from sampling.
On that notes, if a table splits, guide post will be re-written correct? then
may be this won't make it worse.
BTW [~Bin Shi] your meeting summary is very helpful.
> UPDATE STATISTIC should collect all versions of cells
> -----------------------------------------------------
>
> Key: PHOENIX-4008
> URL: https://issues.apache.org/jira/browse/PHOENIX-4008
> Project: Phoenix
> Issue Type: Bug
> Reporter: Samarth Jain
> Assignee: Bin Shi
> Priority: Major
> Fix For: 4.15.0, 5.1.0
>
> Attachments: PHOENIX-4008_0918.patch, PHOENIX-4008_0920.patch,
> PHONEIX-4008.4.X-HBase-1.2.001.patch, PHONEIX-4008.4.X-HBase-1.3.001.patch,
> PHONEIX-4008.4.X-HBase-1.4.001.patch
>
>
> In order to truly measure the size of data when calculating guide posts,
> UPDATE STATISTIC should taken into account all versions of cells. We should
> also be setting the max versions on the scan.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)