[ 
https://issues.apache.org/jira/browse/IGNITE-9141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Ozerov updated IGNITE-9141:
------------------------------------
    Description: 
One of mandatory steps of SQL query execution is topology mapping - we need to 
select nodes where required caches are located, and make sure that their 
partition distribution is valid for the given SQL query. Once nodes are 
detected, we try to reserve partitions of interest on mapper nodes to make sure 
that they will not be evicted during query execution. 

However, mapping step may fail for many reasons. Most often this is rebalance 
or concurrent node failures. In this case we simply retry the whole query 
execution from scratch. In IGNITE-9114 we ensured that retry cycle is not 
infinite and that root cause of remap is logged. However, original root cause 
of remap is not propagated to client node making the problem hard to debug for 
end users. Also we do not have enough tests for remap events. Let's fix this.

Proposed implementation flow:
1) Add {{retryCause: String}} field to {{GridQueryNextPageResponse}} which 
should be populated along with {{retry}} field on mapper node. See 
{{GridMapQueryExecutor#sendRetry}} method to understand what may cause retries 
(failed to reserve partitions or failed to execute non-collocated join). Make 
sure that these error messages are as verbose as possible with all necessary 
details (root cause, cache names, affected partitions, etc).
2) Make sure that root cause is set in {{ReduceQueryRun#state}} and then 
propagated to user exception in case of retry timeout.
3) Evaluate all places inside 
{{org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor#query}}
 which may lead to re-try and make sure that root cause is verbose and 
propagated to user exception in case of retry timeout. 
4) Add tests covering all re-try branches and ensure that query fails after 
timeout and that error message is correct.

  was:
One of mandatory steps of SQL query execution is topology mapping - we need to 
select nodes where required caches are located, and make sure that their 
partition distribution is valid for the given SQL query. Once nodes are 
detected, we try to reserve partitions of interest on mapper nodes to make sure 
that they will not be evicted during query execution. 

However, mapping step may fail for many reasons. Most often this is rebalance 
or concurrent node failures. In this case we simply retry the whole query 
execution from scratch. In IGNITE-9114 we ensured that retry cycle is not 
infinite and that root cause of remap is logged. However, original root cause 
of remap is not propagated to client node making the problem hard to debug for 
end users. Also we do not have enough tests for remap events. Let's fix this.


> SQL: Trace and test query mapping problems
> ------------------------------------------
>
>                 Key: IGNITE-9141
>                 URL: https://issues.apache.org/jira/browse/IGNITE-9141
>             Project: Ignite
>          Issue Type: Task
>          Components: sql
>    Affects Versions: 2.6
>            Reporter: Vladimir Ozerov
>            Priority: Major
>             Fix For: 2.7
>
>
> One of mandatory steps of SQL query execution is topology mapping - we need 
> to select nodes where required caches are located, and make sure that their 
> partition distribution is valid for the given SQL query. Once nodes are 
> detected, we try to reserve partitions of interest on mapper nodes to make 
> sure that they will not be evicted during query execution. 
> However, mapping step may fail for many reasons. Most often this is rebalance 
> or concurrent node failures. In this case we simply retry the whole query 
> execution from scratch. In IGNITE-9114 we ensured that retry cycle is not 
> infinite and that root cause of remap is logged. However, original root cause 
> of remap is not propagated to client node making the problem hard to debug 
> for end users. Also we do not have enough tests for remap events. Let's fix 
> this.
> Proposed implementation flow:
> 1) Add {{retryCause: String}} field to {{GridQueryNextPageResponse}} which 
> should be populated along with {{retry}} field on mapper node. See 
> {{GridMapQueryExecutor#sendRetry}} method to understand what may cause 
> retries (failed to reserve partitions or failed to execute non-collocated 
> join). Make sure that these error messages are as verbose as possible with 
> all necessary details (root cause, cache names, affected partitions, etc).
> 2) Make sure that root cause is set in {{ReduceQueryRun#state}} and then 
> propagated to user exception in case of retry timeout.
> 3) Evaluate all places inside 
> {{org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor#query}}
>  which may lead to re-try and make sure that root cause is verbose and 
> propagated to user exception in case of retry timeout. 
> 4) Add tests covering all re-try branches and ensure that query fails after 
> timeout and that error message is correct.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to