[jira] [Updated] (OOZIE-3260) [sla] Remove stale item above max retries on JPA related errors from in-memory SLA map

2018-06-06 Thread Andras Piros (JIRA)


 [ 
https://issues.apache.org/jira/browse/OOZIE-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Piros updated OOZIE-3260:

Attachment: OOZIE-3260.002.patch

> [sla] Remove stale item above max retries on JPA related errors from 
> in-memory SLA map
> --
>
> Key: OOZIE-3260
> URL: https://issues.apache.org/jira/browse/OOZIE-3260
> Project: Oozie
>  Issue Type: Bug
>  Components: coordinator, core, workflow
>Affects Versions: 5.0.0
>Reporter: Andras Piros
>Assignee: Andras Piros
>Priority: Major
> Attachments: OOZIE-3260.001.patch, OOZIE-3260.002.patch
>
>
> Despite having implemented OOZIE-3134, there are still cases where 
> {{SLACalculatorMemory#slaMap}} and database contents still get out of sync. 
> Some possibilities including but not limited to:
>  * database contents of {{SLA_SUMMARY}} table have been purged manually from 
> DB
>  * no corresponding {{WF_JOBS}} or {{COORD_JOBS}} entries exist anymore in DB
>  * the {{WF_JOBS}} or {{COORD_JOBS}} instance that is being tracked by the 
> {{SLACalcStatus}} instances inside {{SLACalculatorMemory#slaMap}} is not yet 
> persisted to database when the SLA entry is already processed by 
> {{SLACalculatorMemory.HistoryPurgeWorker}}. Depending on e.g. how many 
> coordinator actions are being materialized, it can very well happen that 
> {{SLACalcStatus}} entries inserted to the in-memory map will be processed 
> before their corresponding {{CoordActionBean}} entries are yet to be 
> persisted to database
> In those rare cases, we see {{JPAExecutorException}} instances like:
> {noformat}
> 2017-10-09 17:00:18,185 DEBUG openjpa.jdbc.SQL: SERVER[HOST]  conn 1584126245> [0 ms] spent
> 2017-10-09 17:00:18,185 ERROR org.apache.oozie.sla.SLACalculatorMemory: 
> SERVER[tplhc01c001.iuser.iroot.adidom.com] USER[-] GROUP[-] TOKEN[-] APP[-] 
> JOB[438-170916014916144-oozie-oozi-C@556] ACTION[-] Exception in SLA 
> processing for job [438-170916014916144-oozie-oozi-C@556]
> org.apache.oozie.executor.jpa.JPAExecutorException: E0604: Job does not exist 
> [select w.eventProcessed from SLASummaryBean w where w.jobId = :id]
> at 
> org.apache.oozie.executor.jpa.SLASummaryQueryExecutor.getSingleValue(SLASummaryQueryExecutor.java:161)
> at 
> org.apache.oozie.sla.SLACalculatorMemory.updateJobSla(SLACalculatorMemory.java:480)
> at 
> org.apache.oozie.sla.SLACalculatorMemory.updateAllSlaStatus(SLACalculatorMemory.java:601)
> {noformat}
> Solution here is to track the number of times the {{SLACalcStatus}} entry has 
> not been processed successfully, and when a preconfigured 
> {{oozie.sla.service.SLAService.maximum.retry.count}} is reached, remove any 
> {{SLACalculatorMemory#slaMap}} entries that are causing those 
> {{JPAExecutorException}} instances, to not cause huge logfiles. The items to 
> be logged don't exist, anyways.
> It's still possible that multiple {{CoordActionBean}} instances being 
> inserted won't have {{SLACalcStatus}} entries inside 
> {{SLACalculatorMemory#slaMap}} by the time written to database, and thus, no 
> SLA will be tracked. In those rare cases, preconfigured maximum retry count 
> can be extended.
> Note that current implementation of 
> [*{{SLACalculatorMemory#updateJobSla()}}*|https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/sla/SLACalculatorMemory.java#L238-L242]
>  already removes the stale {{SLACalcStatus}} entry. The new functionality 
> here is to introduce {{SLACalcStatus#retryCount}}, and extend the 
> {{JPAExecutorException}} {{ErrorCode}}s of interest.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OOZIE-3260) [sla] Remove stale item above max retries on JPA related errors from in-memory SLA map

2018-05-31 Thread Andras Piros (JIRA)


 [ 
https://issues.apache.org/jira/browse/OOZIE-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Piros updated OOZIE-3260:

Attachment: OOZIE-3260.001.patch

> [sla] Remove stale item above max retries on JPA related errors from 
> in-memory SLA map
> --
>
> Key: OOZIE-3260
> URL: https://issues.apache.org/jira/browse/OOZIE-3260
> Project: Oozie
>  Issue Type: Bug
>  Components: coordinator, core, workflow
>Affects Versions: 5.0.0
>Reporter: Andras Piros
>Assignee: Andras Piros
>Priority: Major
> Attachments: OOZIE-3260.001.patch
>
>
> Despite having implemented OOZIE-3134, there are still cases where 
> {{SLACalculatorMemory#slaMap}} and database contents still get out of sync. 
> Some possibilities including but not limited to:
>  * database contents of {{SLA_SUMMARY}} table have been purged manually from 
> DB
>  * no corresponding {{WF_JOBS}} or {{COORD_JOBS}} entries exist anymore in DB
>  * the {{WF_JOBS}} or {{COORD_JOBS}} instance that is being tracked by the 
> {{SLACalcStatus}} instances inside {{SLACalculatorMemory#slaMap}} is not yet 
> persisted to database when the SLA entry is already processed by 
> {{SLACalculatorMemory.HistoryPurgeWorker}}. Depending on e.g. how many 
> coordinator actions are being materialized, it can very well happen that 
> {{SLACalcStatus}} entries inserted to the in-memory map will be processed 
> before their corresponding {{CoordActionBean}} entries are yet to be 
> persisted to database
> In those rare cases, we see {{JPAExecutorException}} instances like:
> {noformat}
> 2017-10-09 17:00:18,185 DEBUG openjpa.jdbc.SQL: SERVER[HOST]  conn 1584126245> [0 ms] spent
> 2017-10-09 17:00:18,185 ERROR org.apache.oozie.sla.SLACalculatorMemory: 
> SERVER[tplhc01c001.iuser.iroot.adidom.com] USER[-] GROUP[-] TOKEN[-] APP[-] 
> JOB[438-170916014916144-oozie-oozi-C@556] ACTION[-] Exception in SLA 
> processing for job [438-170916014916144-oozie-oozi-C@556]
> org.apache.oozie.executor.jpa.JPAExecutorException: E0604: Job does not exist 
> [select w.eventProcessed from SLASummaryBean w where w.jobId = :id]
> at 
> org.apache.oozie.executor.jpa.SLASummaryQueryExecutor.getSingleValue(SLASummaryQueryExecutor.java:161)
> at 
> org.apache.oozie.sla.SLACalculatorMemory.updateJobSla(SLACalculatorMemory.java:480)
> at 
> org.apache.oozie.sla.SLACalculatorMemory.updateAllSlaStatus(SLACalculatorMemory.java:601)
> {noformat}
> Solution here is to track the number of times the {{SLACalcStatus}} entry has 
> not been processed successfully, and when a preconfigured 
> {{oozie.sla.service.SLAService.maximum.retry.count}} is reached, remove any 
> {{SLACalculatorMemory#slaMap}} entries that are causing those 
> {{JPAExecutorException}} instances, to not cause huge logfiles. The items to 
> be logged don't exist, anyways.
> It's still possible that multiple {{CoordActionBean}} instances being 
> inserted won't have {{SLACalcStatus}} entries inside 
> {{SLACalculatorMemory#slaMap}} by the time written to database, and thus, no 
> SLA will be tracked. In those rare cases, preconfigured maximum retry count 
> can be extended.
> Note that current implementation of 
> [*{{SLACalculatorMemory#updateJobSla()}}*|https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/sla/SLACalculatorMemory.java#L238-L242]
>  already removes the stale {{SLACalcStatus}} entry. The new functionality 
> here is to introduce {{SLACalcStatus#retryCount}}, and extend the 
> {{JPAExecutorException}} {{ErrorCode}}s of interest.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (OOZIE-3260) [sla] Remove stale item above max retries on JPA related errors from in-memory SLA map

2018-05-31 Thread Andras Piros (JIRA)


 [ 
https://issues.apache.org/jira/browse/OOZIE-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Piros updated OOZIE-3260:

Description: 
Despite having implemented OOZIE-3134, there are still cases where 
{{SLACalculatorMemory#slaMap}} and database contents still get out of sync. 
Some possibilities including but not limited to:
 * database contents of {{SLA_SUMMARY}} table have been purged manually from DB
 * no corresponding {{WF_JOBS}} or {{COORD_JOBS}} entries exist anymore in DB
 * the {{WF_JOBS}} or {{COORD_JOBS}} instance that is being tracked by the 
{{SLACalcStatus}} instances inside {{SLACalculatorMemory#slaMap}} is not yet 
persisted to database when the SLA entry is already processed by 
{{SLACalculatorMemory.HistoryPurgeWorker}}. Depending on e.g. how many 
coordinator actions are being materialized, it can very well happen that 
{{SLACalcStatus}} entries inserted to the in-memory map will be processed 
before their corresponding {{CoordActionBean}} entries are yet to be persisted 
to database

In those rare cases, we see {{JPAExecutorException}} instances like:
{noformat}
2017-10-09 17:00:18,185 DEBUG openjpa.jdbc.SQL: SERVER[HOST]  [0 ms] spent
2017-10-09 17:00:18,185 ERROR org.apache.oozie.sla.SLACalculatorMemory: 
SERVER[tplhc01c001.iuser.iroot.adidom.com] USER[-] GROUP[-] TOKEN[-] APP[-] 
JOB[438-170916014916144-oozie-oozi-C@556] ACTION[-] Exception in SLA 
processing for job [438-170916014916144-oozie-oozi-C@556]
org.apache.oozie.executor.jpa.JPAExecutorException: E0604: Job does not exist 
[select w.eventProcessed from SLASummaryBean w where w.jobId = :id]
at 
org.apache.oozie.executor.jpa.SLASummaryQueryExecutor.getSingleValue(SLASummaryQueryExecutor.java:161)
at 
org.apache.oozie.sla.SLACalculatorMemory.updateJobSla(SLACalculatorMemory.java:480)
at 
org.apache.oozie.sla.SLACalculatorMemory.updateAllSlaStatus(SLACalculatorMemory.java:601)
{noformat}
Solution here is to track the number of times the {{SLACalcStatus}} entry has 
not been processed successfully, and when a preconfigured 
{{oozie.sla.service.SLAService.maximum.retry.count}} is reached, remove any 
{{SLACalculatorMemory#slaMap}} entries that are causing those 
{{JPAExecutorException}} instances, to not cause huge logfiles. The items to be 
logged don't exist, anyways.

It's still possible that multiple {{CoordActionBean}} instances being inserted 
won't have {{SLACalcStatus}} entries inside {{SLACalculatorMemory#slaMap}} by 
the time written to database, and thus, no SLA will be tracked. In those rare 
cases, preconfigured maximum retry count can be extended.

Note that current implementation of 
[*{{SLACalculatorMemory#updateJobSla()}}*|https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/sla/SLACalculatorMemory.java#L238-L242]
 already removes the stale {{SLACalcStatus}} entry. The new functionality here 
is to introduce {{SLACalcStatus#retryCount}}, and extend the 
{{JPAExecutorException}} {{ErrorCode}}s of interest.

  was:
Despite having implemented OOZIE-3134, there are still cases where 
{{SLACalculatorMemory#slaMap}} and database contents still get out of sync. 
Some possibilities including but not limited to:
 * database contents of {{SLA_SUMMARY}} table have been purged manually from DB
 * no corresponding {{WF_JOBS}} or {{COORD_JOBS}} entries exist anymore in DB
 * the {{WF_JOBS}} or {{COORD_JOBS}} instance that is being tracked by the 
{{SLACalcStatus}} instances inside {{SLACalculatorMemory#slaMap}} is not yet 
persisted to database when the SLA entry is already processed by 
{{SLACalculatorMemory.HistoryPurgeWorker}}. Depending on e.g. how many 
coordinator actions are being materialized, it can very well happen that 
{{SLACalcStatus}} entries inserted to the in-memory map will be processed 
before their corresponding {{CoordActionBean}} entries are yet to be persisted 
to database

In those rare cases, we see {{JPAExecutorException}} instances like:
{noformat}
2017-10-09 17:00:18,185 DEBUG openjpa.jdbc.SQL: SERVER[HOST]  [0 ms] spent
2017-10-09 17:00:18,185 ERROR org.apache.oozie.sla.SLACalculatorMemory: 
SERVER[tplhc01c001.iuser.iroot.adidom.com] USER[-] GROUP[-] TOKEN[-] APP[-] 
JOB[438-170916014916144-oozie-oozi-C@556] ACTION[-] Exception in SLA 
processing for job [438-170916014916144-oozie-oozi-C@556]
org.apache.oozie.executor.jpa.JPAExecutorException: E0604: Job does not exist 
[select w.eventProcessed from SLASummaryBean w where w.jobId = :id]
at 
org.apache.oozie.executor.jpa.SLASummaryQueryExecutor.getSingleValue(SLASummaryQueryExecutor.java:161)
at 
org.apache.oozie.sla.SLACalculatorMemory.updateJobSla(SLACalculatorMemory.java:480)
at 
org.apache.oozie.sla.SLACalculatorMemory.updateAllSlaStatus(SLACalculatorMemory.java:601)
{noformat}
Solution here is to track the number of times the {{SLACalcStatus}} entry has 
not been processed 

[jira] [Updated] (OOZIE-3260) [sla] Remove stale item above max retries on JPA related errors from in-memory SLA map

2018-05-31 Thread Andras Piros (JIRA)


 [ 
https://issues.apache.org/jira/browse/OOZIE-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Piros updated OOZIE-3260:

Summary: [sla] Remove stale item above max retries on JPA related errors 
from in-memory SLA map  (was: [sla] Remove stale item on JPA related errors 
from in-memory SLA map)

> [sla] Remove stale item above max retries on JPA related errors from 
> in-memory SLA map
> --
>
> Key: OOZIE-3260
> URL: https://issues.apache.org/jira/browse/OOZIE-3260
> Project: Oozie
>  Issue Type: Bug
>  Components: coordinator, core, workflow
>Affects Versions: 5.0.0
>Reporter: Andras Piros
>Assignee: Andras Piros
>Priority: Major
>
> Despite having implemented OOZIE-3134, there are still cases where 
> {{SLACalculatorMemory#slaMap}} and database contents still get out of sync. 
> Some possibilities including but not limited to:
>  * database contents of {{SLA_SUMMARY}} table have been purged manually from 
> DB
>  * no corresponding {{WF_JOBS}} or {{COORD_JOBS}} entries exist anymore in DB
>  * the {{WF_JOBS}} or {{COORD_JOBS}} instance that is being tracked by the 
> {{SLACalcStatus}} instances inside {{SLACalculatorMemory#slaMap}} is not yet 
> persisted to database when the SLA entry is already processed by 
> {{SLACalculatorMemory.HistoryPurgeWorker}}. Depending on e.g. how many 
> coordinator actions are being materialized, it can very well happen that 
> {{SLACalcStatus}} entries inserted to the in-memory map will be processed 
> before their corresponding {{CoordActionBean}} entries are yet to be 
> persisted to database
> In those rare cases, we see {{JPAExecutorException}} instances like:
> {noformat}
> 2017-10-09 17:00:18,185 DEBUG openjpa.jdbc.SQL: SERVER[HOST]  conn 1584126245> [0 ms] spent
> 2017-10-09 17:00:18,185 ERROR org.apache.oozie.sla.SLACalculatorMemory: 
> SERVER[tplhc01c001.iuser.iroot.adidom.com] USER[-] GROUP[-] TOKEN[-] APP[-] 
> JOB[438-170916014916144-oozie-oozi-C@556] ACTION[-] Exception in SLA 
> processing for job [438-170916014916144-oozie-oozi-C@556]
> org.apache.oozie.executor.jpa.JPAExecutorException: E0604: Job does not exist 
> [select w.eventProcessed from SLASummaryBean w where w.jobId = :id]
> at 
> org.apache.oozie.executor.jpa.SLASummaryQueryExecutor.getSingleValue(SLASummaryQueryExecutor.java:161)
> at 
> org.apache.oozie.sla.SLACalculatorMemory.updateJobSla(SLACalculatorMemory.java:480)
> at 
> org.apache.oozie.sla.SLACalculatorMemory.updateAllSlaStatus(SLACalculatorMemory.java:601)
> {noformat}
> Solution here is to track the number of times the {{SLACalcStatus}} entry has 
> not been processed successfully, and when a preconfigured 
> {{oozie.sla.service.SLAService.maximum.retry.count}} is reached, remove any 
> {{SLACalculatorMemory#slaMap}} entries that are causing those 
> {{JPAExecutorException}} instances, to not cause huge logfiles. The items to 
> be logged don't exist, anyways.
> It's still possible that multiple {{CoordActionBean}} instances being 
> inserted won't have {{SLACalcStatus}} entries inside 
> {{SLACalculatorMemory#slaMap}} by the time written to database, and thus, no 
> SLA will be tracked. In those rare cases, preconfigured maximum retry count 
> can be extended.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)