[ 
https://issues.apache.org/jira/browse/IMPALA-10317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17230396#comment-17230396
 ] 

Aman Sinha commented on IMPALA-10317:
-------------------------------------

I think MySQL's MAX_JOIN_SIZE is a reasonable way because it gets applied based 
on planner estimates. Here's the relevant excerpt:
{noformat}
>From https://dev.mysql.com/doc/refman/8.0/en/mysql-tips.html#safe-updates: 
Setting max_join_size to 1,000,000 causes multiple-table SELECT statements to 
produce an error if the server estimates it must examine more than 1,000,000 
row combinations.
{noformat}
Note that it says 'estimates'.   Did you consider using the planner estimate ? 
Understandably, it may underestimate substantially if the stats are not current 
but if users accidentally submit (for example) cross-joins of large tables it 
should is most cases prevent those cases.  I think that if a join is in fact 
hugely expanding, wouldn't we want to detect that sooner (in planning phase) 
and avoid executing altogether instead of letting it run until it hits the 
NUM_JOIN_ROWS_PRODUCED_LIMIT threshold ?  How would a user pick a value for 
this threshold considering that it depends on the size of the cluster, amount 
of memory available etc. 
These are more food for thought :) We can get some other opinions as well.



> Add query option that limits join #rows at runtime
> --------------------------------------------------
>
>                 Key: IMPALA-10317
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10317
>             Project: IMPALA
>          Issue Type: New Feature
>          Components: Backend
>            Reporter: Fucun Chu
>            Assignee: Fucun Chu
>            Priority: Major
>         Attachments: query82_summary.png
>
>
> Reject queries that rows produced too bigger by join operator when executing 
> the query.
> This is a mechanism to protect the cluster from potentially harmful queries.
> When the cardinality of the table is very large and the join conditions are 
> very bad, the number of rows produced by the join will be very large, 
> sometimes tens of billions, which affects the cluster status and other 
> running queries.
> In our environment, the NUM_JOIN_ROWS_PRODUCED_LIMIT query option is added to 
> limit the number of rows produced by a single join operator.
> Implementation refers to 
> [IMPALA-6034|https://issues.apache.org/jira/browse/IMPALA-6034] and summary 
> (see the figure below), check the join operator #rows size



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to