[
https://issues.apache.org/jira/browse/PHOENIX-4008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622855#comment-16622855
]
Bin Shi commented on PHOENIX-4008:
----------------------------------
Below is the summary of yesterday's discussion with [~tdsilva] and
[~karanmehta93].
# Now we're on the same page – Scan.setRaw(true) is used to return all the
deleted rows and it won't return all versions of cells. To return all the
versions of cells, the only API we need is Scan.readAllVersions() which is the
one used by this change.
# In order to truly measure the size of data when calculating guide posts, we
need to collect multiple versions of cells (that is what this change for) and
the deleted rows. Whether the select query returning the latest version of
cells or multiple versions is orthogonal to Stats which measures the cost of
scans and is used for query complexity estimation and query optimization.
# The current implementation of table sampling is based on the assumption
"Every two consecutive guide posts contains the equal number of rows" which
isn't accurate in practice, and once we collect multiple versions of cells and
the deleted rows, the thing will become worse. The conclusion is that it's ok
to commit the change for collecting multiple versions of cells while we can
open a Jira (PHOENIX-4912 opened) to track the issue.
# We agree to open another Jira (PHOENIX-4913) to track the issue "UPDATE
STATISTICS should run raw scan to collect the deleted rows" and prioritize this
work item with all the other work items for Stats including freshness SLA,
accuracy, resiliency and performance, etc.
> UPDATE STATISTIC should collect all versions of cells
> -----------------------------------------------------
>
> Key: PHOENIX-4008
> URL: https://issues.apache.org/jira/browse/PHOENIX-4008
> Project: Phoenix
> Issue Type: Bug
> Reporter: Samarth Jain
> Assignee: Bin Shi
> Priority: Major
> Attachments: PHOENIX-4008_0918.patch, PHOENIX-4008_0920.patch
>
>
> In order to truly measure the size of data when calculating guide posts,
> UPDATE STATISTIC should taken into account all versions of cells. We should
> also be setting the max versions on the scan.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)