[
https://issues.apache.org/jira/browse/IMPALA-10317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17230396#comment-17230396
]
Aman Sinha commented on IMPALA-10317:
-------------------------------------
I think MySQL's MAX_JOIN_SIZE is a reasonable way because it gets applied based
on planner estimates. Here's the relevant excerpt:
{noformat}
>From https://dev.mysql.com/doc/refman/8.0/en/mysql-tips.html#safe-updates:
Setting max_join_size to 1,000,000 causes multiple-table SELECT statements to
produce an error if the server estimates it must examine more than 1,000,000
row combinations.
{noformat}
Note that it says 'estimates'. Did you consider using the planner estimate ?
Understandably, it may underestimate substantially if the stats are not current
but if users accidentally submit (for example) cross-joins of large tables it
should is most cases prevent those cases. I think that if a join is in fact
hugely expanding, wouldn't we want to detect that sooner (in planning phase)
and avoid executing altogether instead of letting it run until it hits the
NUM_JOIN_ROWS_PRODUCED_LIMIT threshold ? How would a user pick a value for
this threshold considering that it depends on the size of the cluster, amount
of memory available etc.
These are more food for thought :) We can get some other opinions as well.
> Add query option that limits join #rows at runtime
> --------------------------------------------------
>
> Key: IMPALA-10317
> URL: https://issues.apache.org/jira/browse/IMPALA-10317
> Project: IMPALA
> Issue Type: New Feature
> Components: Backend
> Reporter: Fucun Chu
> Assignee: Fucun Chu
> Priority: Major
> Attachments: query82_summary.png
>
>
> Reject queries that rows produced too bigger by join operator when executing
> the query.
> This is a mechanism to protect the cluster from potentially harmful queries.
> When the cardinality of the table is very large and the join conditions are
> very bad, the number of rows produced by the join will be very large,
> sometimes tens of billions, which affects the cluster status and other
> running queries.
> In our environment, the NUM_JOIN_ROWS_PRODUCED_LIMIT query option is added to
> limit the number of rows produced by a single join operator.
> Implementation refers to
> [IMPALA-6034|https://issues.apache.org/jira/browse/IMPALA-6034] and summary
> (see the figure below), check the join operator #rows size
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]