[
https://issues.apache.org/jira/browse/IGNITE-9141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vladimir Ozerov updated IGNITE-9141:
------------------------------------
Description:
One of mandatory steps of SQL query execution is topology mapping - we need to
select nodes where required caches are located, and make sure that their
partition distribution is valid for the given SQL query. Once nodes are
detected, we try to reserve partitions of interest on mapper nodes to make sure
that they will not be evicted during query execution.
However, mapping step may fail for many reasons. Most often this is rebalance
or concurrent node failures. In this case we simply retry the whole query
execution from scratch. In IGNITE-9114 we ensured that retry cycle is not
infinite and that root cause of remap is logged. However, original root cause
of remap is not propagated to client node making the problem hard to debug for
end users. Also we do not have enough tests for remap events. Let's fix this.
Proposed implementation flow:
1) Add {{retryCause: String}} field to {{GridQueryNextPageResponse}} which
should be populated along with {{retry}} field on mapper node. See
{{GridMapQueryExecutor#sendRetry}} method to understand what may cause retries
(failed to reserve partitions or failed to execute non-collocated join). Make
sure that these error messages are as verbose as possible with all necessary
details (root cause, cache names, affected partitions, etc).
2) Make sure that root cause is set in {{ReduceQueryRun#state}} and then
propagated to user exception in case of retry timeout.
3) Evaluate all places inside
{{org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor#query}}
which may lead to re-try and make sure that root cause is verbose and
propagated to user exception in case of retry timeout.
4) Add tests covering all re-try branches and ensure that query fails after
timeout and that error message is correct.
*NB*: Once propagation of error message to reducer is implemented, we may
remove additional logging altogether.
was:
One of mandatory steps of SQL query execution is topology mapping - we need to
select nodes where required caches are located, and make sure that their
partition distribution is valid for the given SQL query. Once nodes are
detected, we try to reserve partitions of interest on mapper nodes to make sure
that they will not be evicted during query execution.
However, mapping step may fail for many reasons. Most often this is rebalance
or concurrent node failures. In this case we simply retry the whole query
execution from scratch. In IGNITE-9114 we ensured that retry cycle is not
infinite and that root cause of remap is logged. However, original root cause
of remap is not propagated to client node making the problem hard to debug for
end users. Also we do not have enough tests for remap events. Let's fix this.
Proposed implementation flow:
1) Add {{retryCause: String}} field to {{GridQueryNextPageResponse}} which
should be populated along with {{retry}} field on mapper node. See
{{GridMapQueryExecutor#sendRetry}} method to understand what may cause retries
(failed to reserve partitions or failed to execute non-collocated join). Make
sure that these error messages are as verbose as possible with all necessary
details (root cause, cache names, affected partitions, etc).
2) Make sure that root cause is set in {{ReduceQueryRun#state}} and then
propagated to user exception in case of retry timeout.
3) Evaluate all places inside
{{org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor#query}}
which may lead to re-try and make sure that root cause is verbose and
propagated to user exception in case of retry timeout.
4) Add tests covering all re-try branches and ensure that query fails after
timeout and that error message is correct.
> SQL: Trace and test query mapping problems
> ------------------------------------------
>
> Key: IGNITE-9141
> URL: https://issues.apache.org/jira/browse/IGNITE-9141
> Project: Ignite
> Issue Type: Task
> Components: sql
> Affects Versions: 2.6
> Reporter: Vladimir Ozerov
> Priority: Major
> Fix For: 2.7
>
>
> One of mandatory steps of SQL query execution is topology mapping - we need
> to select nodes where required caches are located, and make sure that their
> partition distribution is valid for the given SQL query. Once nodes are
> detected, we try to reserve partitions of interest on mapper nodes to make
> sure that they will not be evicted during query execution.
> However, mapping step may fail for many reasons. Most often this is rebalance
> or concurrent node failures. In this case we simply retry the whole query
> execution from scratch. In IGNITE-9114 we ensured that retry cycle is not
> infinite and that root cause of remap is logged. However, original root cause
> of remap is not propagated to client node making the problem hard to debug
> for end users. Also we do not have enough tests for remap events. Let's fix
> this.
> Proposed implementation flow:
> 1) Add {{retryCause: String}} field to {{GridQueryNextPageResponse}} which
> should be populated along with {{retry}} field on mapper node. See
> {{GridMapQueryExecutor#sendRetry}} method to understand what may cause
> retries (failed to reserve partitions or failed to execute non-collocated
> join). Make sure that these error messages are as verbose as possible with
> all necessary details (root cause, cache names, affected partitions, etc).
> 2) Make sure that root cause is set in {{ReduceQueryRun#state}} and then
> propagated to user exception in case of retry timeout.
> 3) Evaluate all places inside
> {{org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor#query}}
> which may lead to re-try and make sure that root cause is verbose and
> propagated to user exception in case of retry timeout.
> 4) Add tests covering all re-try branches and ensure that query fails after
> timeout and that error message is correct.
> *NB*: Once propagation of error message to reducer is implemented, we may
> remove additional logging altogether.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)