[jira] [Commented] (PHOENIX-258) Use skip scan when SELECT DISTINCT on leading row key column(s)

James Taylor (JIRA) Thu, 14 Apr 2016 12:24:06 -0700

    [ 
https://issues.apache.org/jira/browse/PHOENIX-258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241760#comment-15241760
 ]


James Taylor commented on PHOENIX-258:
--------------------------------------

[~lhofhansl] - you ok if we assign this to you (as you've been dabbling with it 
recently)? 

FYI, to get the perf gain of the skip scan, I believe you'll need to pass a new 
boolean to the SkipScanFilter that indicates it's being used for DISTINCT. 
Otherwise, rather than skipping all duplicate rows, it's going to include them. 
So a tweak like this in SkipScanFilter.navigate():
{code}
        // First check to see if we're in-range until we reach our end key
        if (endKeyLength > 0) {
            if (!this.isDistinct && Bytes.compareTo(currentKey, offset, length, 
endKey, 0, endKeyLength) < 0) {
                return getIncludeReturnCode();
            }
{code}

Also, if there's a filter from the WHERE clause (i.e. any remaining filtering 
that was left over after computing the start/stop row of the scan), I suspect 
you won't be able to perform this optimization. In that case, you still need to 
traverse the rows first, before aggregating as you wouldn't know which of the 
duplicate rows match or don't match without looking at them all.

> Use skip scan when SELECT DISTINCT on leading row key column(s)
> ---------------------------------------------------------------
>
>                 Key: PHOENIX-258
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-258
>             Project: Phoenix
>          Issue Type: Task
>            Reporter: ryang-sfdc
>              Labels: gsoc2016
>             Fix For: 4.8.0
>
>
> create table(a varchar(32) not null, date date not null constraint pk primary 
> key(a,date))
> [["PLAN"],["CLIENT PARALLEL 94-WAY FULL SCAN OVER foo"],["    SERVER 
> AGGREGATE INTO ORDERED DISTINCT ROWS BY [a]"],["CLIENT MERGE SORT"]]          
>    
> We should skip scan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PHOENIX-258) Use skip scan when SELECT DISTINCT on leading row key column(s)

Reply via email to