[jira] [Commented] (PHOENIX-258) Use skip scan when SELECT DISTINCT on leading row key column(s)

James Taylor (JIRA) Sat, 28 May 2016 15:08:34 -0700

    [ 
https://issues.apache.org/jira/browse/PHOENIX-258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305654#comment-15305654
 ]


James Taylor commented on PHOENIX-258:
--------------------------------------

This is awesome, [~lhofhansl]! Thanks so much for working through this. Here's 
some feedback:
- In BaseResultIterators when you decide whether or not you can use the 
distinct filter, you can just check if context.getAggregateManager().isEmpty() 
as this means there are no aggregate functions in use. No need to loop through 
all the expressions. We need to go through every row if we're calculating some 
aggregate across the rows, but otherwise don't (and as you've discovered, a 
DISTINCT ends up as a GROUP BY.
- Also, I think it'd be better if OrderPreservingTracker remembered how many 
columns where traversed over rather than the check you have for {{cols < 
plan.getTableRef().getTable().getRowKeySchema().getFieldCount()}}. You can do 
that easily by adding a new orderPreservingColumnCount member variable to 
OrderPreservingTracker and set it at the bottom of the isOrderPreserving() to 
prevPos + prevSlotSpan. Then you can carry it through to the GroupBy object in 
GroupBy.compile().
- Add an IT or other test for a group by of a RVC, as that'll demonstrate the 
need for the above. For example:
{code}
SELECT (pk1,pk2) FROM t GROUP BY (pk1,pk2);
{code}
That would have a single expression, but still be optimizable.
- Minor nit, by why DistinctFixedPrefixFilter as the filter name instead of 
DistinctPrefixFilter? Why the word "Fixed" there?
- For the DistinctFixedPrefixFilter.getNextCellHint() method, you only need to 
call ByteUtil.nextKey() in the fixed length case, as adding the zero byte is 
essentially getting the next key.
- For the DistinctFixedPrefixFilter.getNextCellHint() method, I believe you'll 
need to tweak the logic if the scan is a reverse scan. I think instead of 
calling nextKey, you'd want to call previousKey, and instead of adding a zero 
byte, you need to BaseScannerRegionObserver.getReversedRow() - move it to 
ByteUtil instead. I've noticed HBase has the same hacky code in 
ClientScanner.createClosestRowBefore() - I wish HBase would've handled this 
different for reverse scans as it's really not 100% reliable. meh. :-). Anyway, 
to detect on the client-side so you can push through the filter if it's a 
reverse scan, use the ScanUtil.isReversed() method, as we don't set it on the 
Scan, but do it ourselves in our coprocessor for a variety of reasons. An IT or 
lower level test would be good for this.
- So when there's a WHERE clause, everything works out - HBase is pretty smart 
about that. What about the case where say the first row wouldn't match our 
boolean filter for the where clause, but the second one would? Would the 
distinct filter jump past it? For example:
{code}
SELECT DISTINCT p1, p2 FROM t WHERE p3 = 'foo'
{code}
With rows of p3='bar' followed by p3='foo' and both p1 and p2 being the same 
for those rows (and all three being pk columns). Would be good to have a test 
like this - how do you know when you can jump and when you can't as seems like 
it'd rely on what the other filters do.


> Use skip scan when SELECT DISTINCT on leading row key column(s)
> ---------------------------------------------------------------
>
>                 Key: PHOENIX-258
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-258
>             Project: Phoenix
>          Issue Type: Task
>            Reporter: ryang-sfdc
>            Assignee: Lars Hofhansl
>             Fix For: 4.8.0
>
>         Attachments: 258-WIP.txt, 258-v1.txt, 258-v2.txt, 258-v3.txt, 
> 258-v4.txt, 258-v5.txt, 258-v6.txt, 258.txt, DistinctFixedPrefixFilter.java, 
> in-clause.png
>
>
> create table(a varchar(32) not null, date date not null constraint pk primary 
> key(a,date))
> [["PLAN"],["CLIENT PARALLEL 94-WAY FULL SCAN OVER foo"],["    SERVER 
> AGGREGATE INTO ORDERED DISTINCT ROWS BY [a]"],["CLIENT MERGE SORT"]]          
>    
> We should skip scan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PHOENIX-258) Use skip scan when SELECT DISTINCT on leading row key column(s)

Reply via email to