[
https://issues.apache.org/jira/browse/IGNITE-9141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vladimir Ozerov updated IGNITE-9141:
------------------------------------
Ignite Flags: (was: Docs Required)
> SQL: Trace and test query mapping problems
> ------------------------------------------
>
> Key: IGNITE-9141
> URL: https://issues.apache.org/jira/browse/IGNITE-9141
> Project: Ignite
> Issue Type: Task
> Components: sql
> Affects Versions: 2.6
> Reporter: Vladimir Ozerov
> Assignee: Sergey Grimstad
> Priority: Major
> Labels: sql-stability
> Fix For: 2.7
>
> Attachments: IGNITE-9141__Implemented_.patch
>
>
> One of mandatory steps of SQL query execution is topology mapping - we need
> to select nodes where required caches are located, and make sure that their
> partition distribution is valid for the given SQL query. Once nodes are
> detected, we try to reserve partitions of interest on mapper nodes to make
> sure that they will not be evicted during query execution.
> However, mapping step may fail for many reasons. Most often this is rebalance
> or concurrent node failures. In this case we simply retry the whole query
> execution from scratch. In IGNITE-9114 we ensured that retry cycle is not
> infinite and that root cause of remap is logged. However, original root cause
> of remap is not propagated to client node making the problem hard to debug
> for end users. Also we do not have enough tests for remap events. Let's fix
> this.
> Proposed implementation flow:
> 1) Add {{retryCause: String}} field to {{GridQueryNextPageResponse}} which
> should be populated along with {{retry}} field on mapper node. See
> {{GridMapQueryExecutor#sendRetry}} method to understand what may cause
> retries (failed to reserve partitions or failed to execute non-collocated
> join). Make sure that these error messages are as verbose as possible with
> all necessary details (root cause, cache names, affected partitions, etc).
> 2) Make sure that root cause is set in {{ReduceQueryRun#state}} and then
> propagated to user exception in case of retry timeout.
> 3) Evaluate all places inside
> {{org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor#query}}
> which may lead to re-try and make sure that root cause is verbose and
> propagated to user exception in case of retry timeout.
> 4) Add tests covering all re-try branches and ensure that query fails after
> timeout and that error message is correct.
> *NB*: Once propagation of error message to reducer is implemented, we may
> remove additional logging altogether.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)