[ 
https://issues.apache.org/jira/browse/PHOENIX-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16359048#comment-16359048
 ] 

Maryann Xue commented on PHOENIX-1556:
--------------------------------------

{quote}Should UNION_DISTINCT_FACTOR be 1.0 since we only support UNION ALL 
currently?
{quote}
Since we only support "all", this block won't take effect at all, which means 
the UNION ALL row count will just be the sum of its children's row count.
{quote}What's the reasoning behind stripSkipScanFilter? Is that removed because 
it's effect is already incorporated into the bytes scanned estimate?
{quote}
Yes. {{stripSkipScanFilter()}} also aims to eliminate things like PageFilter 
and looks to keep only boolean expression filters that cannot be pushed into PK.
{quote}Should RowCountVisitor have a method for distinct? In particular, 
there's an optimization we have when doing a distinct on the leading PK columns 
which impacts cost. This optimization is not identified until runtime, so we 
might need to tweak the code so we know about it at compile time. This could be 
done in a separate patch.
{quote}
Thank you for pointing this out! I'll open another JIRA and dig into that.
{quote}Somewhat orthogonal to your pull (but maybe building on top of it), do 
you think it'd be possible to prevent a query from running that's "too 
expensive" (assuming "too expensive" would be identified by a config property)?
{quote}
Maybe. But users should be well aware that the costs are not accurate and they 
do not correspond to a certain amount of time. The absolute value of the cost 
doesn't make so much sense as the difference between the values of alternative 
plans generated from the same query. Besides, consider a QueryPlan consisting 
of a mix of operators, each of which has a different weight in cost evaluation, 
so it would be hard for users to figure out a proper configuration. A probably 
more realistic approach here might be to set a configurable "limit" for 
specific operators. For example, we know that some queries timeout during 
sorting if the dataset is too large, so when calculating the cost for order-by 
(or sometimes client-side order-by), we'd just know. Another example is how we 
handle hash-joins right now: when it's over the limit, we just say it's too 
expensive (represented by the "highest" cost).

> Base hash versus sort merge join decision on cost
> -------------------------------------------------
>
>                 Key: PHOENIX-1556
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1556
>             Project: Phoenix
>          Issue Type: Sub-task
>            Reporter: James Taylor
>            Assignee: Maryann Xue
>            Priority: Major
>              Labels: CostBasedOptimization
>         Attachments: PHOENIX-1556.patch
>
>
> At compile time, we know how many guideposts (i.e. how many bytes) will be 
> scanned for the RHS table. We should, by default, base the decision of using 
> the hash-join verus many-to-many join on this information.
> Another criteria (as we've seen in PHOENIX-4508) is whether or not the tables 
> being joined are already ordered by the join key. In that case, it's better 
> to always use the sort merge join.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to