Github user BinShi-SecularBird commented on a diff in the pull request:
https://github.com/apache/phoenix/pull/351#discussion_r218897162
--- Diff:
phoenix-core/src/main/java/org/apache/phoenix/schema/MetaDataClient.java ---
@@ -1279,6 +1279,7 @@ private long updateStatisticsInternal(PName
physicalName, PTable logicalTable, M
MutationPlan plan =
compiler.compile(Collections.singletonList(tableRef), null, cfs, null,
clientTimeStamp);
Scan scan = plan.getContext().getScan();
scan.setCacheBlocks(false);
+ scan.readAllVersions();
--- End diff --
@twdsilva, @karanmehta93 , could you please confirm that "setRaw() will
provide all versions of cells"? I use setRaw(true) without using
readAllVersions() API and debug to internal, it only returns deleted row and
the latest version of cell which matches the comment on Scan.setRaw() in the
code.
Now let's discuss the relationship between tabling sampling and the two
cases, counting multiple versions in the stats and including deleted rows in
the stats, respectively.
1. Counting multiple versions in the stats.
Yes, it will make tabling sampling result to be inaccurate based on the
current implementation of table sampling. The current implementation of table
sampling is -- (see BaseResultIterators.getParallelScan() which calls
sampleScans(...) at the end of function) iterate all parallel scans generated,
for each scan, if the hash code of start row key of the scan <
tableSamplingRate (See TableSamplerPredicate.java) see pick this scan otherwise
discard this scan. The algorithm has an assumption that the ranges between have
two consecutive guide posts has equal number of rows. Even without counting
multiple versions in the stats, we know this assumption will make the sampling
result to be inaccurate. The right algorithm of table sampling should be based
on the count of rows in each guide post. With the right algorithm, counting
multiple versions in the stats has no conflict with table sampling.
2. Counting the deleted rows in the stats.
We discussed about it in
https://issues.apache.org/jira/browse/PHOENIX-4008. What I propose now is that
we call SetRaw(true) to collect all deleted rows which only contribute to
estimated size of guide post and has NO contribute to the estimated row count
of the guide post. This solution has no conflict with table sampling and query
complexity estimation based on estimated size is still accurate.
Regarding "the select query will return the latest version ... then it
would mean that we probably don't need this PR at all", I don't think so. It
doesn't matter whether the select query will return the latest version or
multiple version, the important thing is that we need to include multiple
versions of cells to make the estimated size of guide post to be more accurate
for query complexity esitmation and the cost for a scan.
---