[
https://issues.apache.org/jira/browse/LENS-743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048596#comment-15048596
]
Puneet Gupta commented on LENS-743:
-----------------------------------
Good Question.
Adding to that ...
I would want a quick retry incase the failure happens early on (what is early
on is questionable) .. say a person submitted a query and it was promoted to
run only after 10 hours and then it fails within few seconds/mins due to
transient failure
But then if say a person submitted a query and it was promoted to run only
after 10 hours and then it fails after 10 more hours, should we re run
immediately.. from user perspective yes ... form lens perspective .. not sure
Another thing we need to consider is if we do a quick retry, the cause of
transient failure may still be persisting. Should we wait and try ? How long
should we wait ? Should all new queries also wait (because they ll fail
anyway).. say in case the failures are coz Hive Server is clogged (excessive
GC/etc)?
Another thought , may be we should have the first re run immediately (after
some set wait time based on type of error) and the subsequent runs can have
exponential wait times.
> Query failure retries for transient errors
> ------------------------------------------
>
> Key: LENS-743
> URL: https://issues.apache.org/jira/browse/LENS-743
> Project: Apache Lens
> Issue Type: Improvement
> Components: server
> Reporter: Amareshwari Sriramadasu
> Assignee: Rajat Khandelwal
>
> There have to be retries for query failures for transient errors like network
> errors (Hive server not reachable/ Metastore not reachable/ DB not
> reachable). Retries should be available for each phase - submission,
> execution, updating status, fetching results and formatting.
> Right now, any such failure results in marking query as failed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)