[
https://issues.apache.org/jira/browse/PHOENIX-258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241760#comment-15241760
]
James Taylor commented on PHOENIX-258:
--------------------------------------
[~lhofhansl] - you ok if we assign this to you (as you've been dabbling with it
recently)?
FYI, to get the perf gain of the skip scan, I believe you'll need to pass a new
boolean to the SkipScanFilter that indicates it's being used for DISTINCT.
Otherwise, rather than skipping all duplicate rows, it's going to include them.
So a tweak like this in SkipScanFilter.navigate():
{code}
// First check to see if we're in-range until we reach our end key
if (endKeyLength > 0) {
if (!this.isDistinct && Bytes.compareTo(currentKey, offset, length,
endKey, 0, endKeyLength) < 0) {
return getIncludeReturnCode();
}
{code}
Also, if there's a filter from the WHERE clause (i.e. any remaining filtering
that was left over after computing the start/stop row of the scan), I suspect
you won't be able to perform this optimization. In that case, you still need to
traverse the rows first, before aggregating as you wouldn't know which of the
duplicate rows match or don't match without looking at them all.
> Use skip scan when SELECT DISTINCT on leading row key column(s)
> ---------------------------------------------------------------
>
> Key: PHOENIX-258
> URL: https://issues.apache.org/jira/browse/PHOENIX-258
> Project: Phoenix
> Issue Type: Task
> Reporter: ryang-sfdc
> Labels: gsoc2016
> Fix For: 4.8.0
>
>
> create table(a varchar(32) not null, date date not null constraint pk primary
> key(a,date))
> [["PLAN"],["CLIENT PARALLEL 94-WAY FULL SCAN OVER foo"],[" SERVER
> AGGREGATE INTO ORDERED DISTINCT ROWS BY [a]"],["CLIENT MERGE SORT"]]
>
> We should skip scan.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)