[
https://issues.apache.org/jira/browse/PHOENIX-258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305654#comment-15305654
]
James Taylor commented on PHOENIX-258:
--------------------------------------
This is awesome, [~lhofhansl]! Thanks so much for working through this. Here's
some feedback:
- In BaseResultIterators when you decide whether or not you can use the
distinct filter, you can just check if context.getAggregateManager().isEmpty()
as this means there are no aggregate functions in use. No need to loop through
all the expressions. We need to go through every row if we're calculating some
aggregate across the rows, but otherwise don't (and as you've discovered, a
DISTINCT ends up as a GROUP BY.
- Also, I think it'd be better if OrderPreservingTracker remembered how many
columns where traversed over rather than the check you have for {{cols <
plan.getTableRef().getTable().getRowKeySchema().getFieldCount()}}. You can do
that easily by adding a new orderPreservingColumnCount member variable to
OrderPreservingTracker and set it at the bottom of the isOrderPreserving() to
prevPos + prevSlotSpan. Then you can carry it through to the GroupBy object in
GroupBy.compile().
- Add an IT or other test for a group by of a RVC, as that'll demonstrate the
need for the above. For example:
{code}
SELECT (pk1,pk2) FROM t GROUP BY (pk1,pk2);
{code}
That would have a single expression, but still be optimizable.
- Minor nit, by why DistinctFixedPrefixFilter as the filter name instead of
DistinctPrefixFilter? Why the word "Fixed" there?
- For the DistinctFixedPrefixFilter.getNextCellHint() method, you only need to
call ByteUtil.nextKey() in the fixed length case, as adding the zero byte is
essentially getting the next key.
- For the DistinctFixedPrefixFilter.getNextCellHint() method, I believe you'll
need to tweak the logic if the scan is a reverse scan. I think instead of
calling nextKey, you'd want to call previousKey, and instead of adding a zero
byte, you need to BaseScannerRegionObserver.getReversedRow() - move it to
ByteUtil instead. I've noticed HBase has the same hacky code in
ClientScanner.createClosestRowBefore() - I wish HBase would've handled this
different for reverse scans as it's really not 100% reliable. meh. :-). Anyway,
to detect on the client-side so you can push through the filter if it's a
reverse scan, use the ScanUtil.isReversed() method, as we don't set it on the
Scan, but do it ourselves in our coprocessor for a variety of reasons. An IT or
lower level test would be good for this.
- So when there's a WHERE clause, everything works out - HBase is pretty smart
about that. What about the case where say the first row wouldn't match our
boolean filter for the where clause, but the second one would? Would the
distinct filter jump past it? For example:
{code}
SELECT DISTINCT p1, p2 FROM t WHERE p3 = 'foo'
{code}
With rows of p3='bar' followed by p3='foo' and both p1 and p2 being the same
for those rows (and all three being pk columns). Would be good to have a test
like this - how do you know when you can jump and when you can't as seems like
it'd rely on what the other filters do.
> Use skip scan when SELECT DISTINCT on leading row key column(s)
> ---------------------------------------------------------------
>
> Key: PHOENIX-258
> URL: https://issues.apache.org/jira/browse/PHOENIX-258
> Project: Phoenix
> Issue Type: Task
> Reporter: ryang-sfdc
> Assignee: Lars Hofhansl
> Fix For: 4.8.0
>
> Attachments: 258-WIP.txt, 258-v1.txt, 258-v2.txt, 258-v3.txt,
> 258-v4.txt, 258-v5.txt, 258-v6.txt, 258.txt, DistinctFixedPrefixFilter.java,
> in-clause.png
>
>
> create table(a varchar(32) not null, date date not null constraint pk primary
> key(a,date))
> [["PLAN"],["CLIENT PARALLEL 94-WAY FULL SCAN OVER foo"],[" SERVER
> AGGREGATE INTO ORDERED DISTINCT ROWS BY [a]"],["CLIENT MERGE SORT"]]
>
> We should skip scan.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)