[jira] Subscription: Oozie Patch Available
Issue Subscription Filter: Oozie Patch Available (100 issues) Subscriber: ooziedaily Key Summary OOZIE-3269 Flaky tests in TestCoordMaterializeTriggerService class https://issues.apache.org/jira/browse/OOZIE-3269 OOZIE-3265 properties RERUN_FAIL_NODES and RERUN_SKIP_NODES should be able to appear together https://issues.apache.org/jira/browse/OOZIE-3265 OOZIE-3256 refactor OozieCLI class https://issues.apache.org/jira/browse/OOZIE-3256 OOZIE-3249 [tools] Instrumentation log parser https://issues.apache.org/jira/browse/OOZIE-3249 OOZIE-3218 Oozie Sqoop action with command splits the select clause into multiple parts due to delimiter being space https://issues.apache.org/jira/browse/OOZIE-3218 OOZIE-3199 Let system property restriction configurable https://issues.apache.org/jira/browse/OOZIE-3199 OOZIE-3196 Authorization: restrict world readability by user https://issues.apache.org/jira/browse/OOZIE-3196 OOZIE-3194 Oozie should set proper permissions to sharelib after upload https://issues.apache.org/jira/browse/OOZIE-3194 OOZIE-3193 Applications are not killed when submitted via subworkflow https://issues.apache.org/jira/browse/OOZIE-3193 OOZIE-3186 Oozie is unable to use configuration linked using jceks://file/... https://issues.apache.org/jira/browse/OOZIE-3186 OOZIE-3179 Adding a configurable config-default.xml location to a workflow https://issues.apache.org/jira/browse/OOZIE-3179 OOZIE-3170 Oozie Diagnostic Bundle tool fails with NPE due to missing service class https://issues.apache.org/jira/browse/OOZIE-3170 OOZIE-3160 PriorityDelayQueue put()/take() can cause significant CPU load due to busy waiting https://issues.apache.org/jira/browse/OOZIE-3160 OOZIE-3156 SSH action status turns OK wrongly when failed to connect to host https://issues.apache.org/jira/browse/OOZIE-3156 OOZIE-3135 Configure log4j2 in SqoopMain https://issues.apache.org/jira/browse/OOZIE-3135 OOZIE-3109 Escape log-streaming's HTML-specific characters https://issues.apache.org/jira/browse/OOZIE-3109 OOZIE-3091 Oozie Sqoop Avro Import fails with "java.lang.NoClassDefFoundError: org/apache/avro/mapred/AvroWrapper" https://issues.apache.org/jira/browse/OOZIE-3091 OOZIE-3071 Oozie 4.3 Spark sharelib ueses a different version of commons-lang3 than Spark 2.2.0 https://issues.apache.org/jira/browse/OOZIE-3071 OOZIE-3063 Sanitizing variables that are part of openjpa.ConnectionProperties https://issues.apache.org/jira/browse/OOZIE-3063 OOZIE-3062 Set HADOOP_CONF_DIR for spark action https://issues.apache.org/jira/browse/OOZIE-3062 OOZIE-3061 Kill only those child jobs which are not already killed https://issues.apache.org/jira/browse/OOZIE-3061 OOZIE-2956 Fix Findbugs warnings related to reliance on default encoding in oozie-core https://issues.apache.org/jira/browse/OOZIE-2956 OOZIE-2955 Fix Findbugs warnings related to reliance on default encoding in oozie-client https://issues.apache.org/jira/browse/OOZIE-2955 OOZIE-2954 Fix Checkstyle issues in oozie-client https://issues.apache.org/jira/browse/OOZIE-2954 OOZIE-2953 Fix Checkstyle issues in oozie-tools https://issues.apache.org/jira/browse/OOZIE-2953 OOZIE-2952 Fix Findbugs warnings in oozie-sharelib-oozie https://issues.apache.org/jira/browse/OOZIE-2952 OOZIE-2949 Escape quotes whitespaces in Sqoop field https://issues.apache.org/jira/browse/OOZIE-2949 OOZIE-2942 [examples] Fix Findbugs warnings https://issues.apache.org/jira/browse/OOZIE-2942 OOZIE-2927 Append new line character for Hive2 query using query tag https://issues.apache.org/jira/browse/OOZIE-2927 OOZIE-2877 Oozie Git Action https://issues.apache.org/jira/browse/OOZIE-2877 OOZIE-2834 ParameterVerifier logging non-useful warning for workflow definition https://issues.apache.org/jira/browse/OOZIE-2834 OOZIE-2833 when using uber mode the regex pattern used in the extractHeapSizeMB method does not allow heap sizes specified in bytes. https://issues.apache.org/jira/browse/OOZIE-2833 OOZIE-2829 Improve sharelib upload to accept multiple source folders https://issues.apache.org/jira/browse/OOZIE-2829 OOZIE-2812 SparkConfigurationService should support loading configurations from multiple Spark versions https://issues.apache.org/jira/browse/OOZIE-2812 OOZIE-2795 Create lib directory or symlink for Oozie CLI during packaging https://issues.apache.org/jira/browse/OOZIE-2795 OOZIE-2791 ShareLib installation may fail on busy Hadoop clusters https://issues.apache.org/jira/browse/OOZIE-2791 OOZIE-2784 Include WEEK as a parameter in the Coordinator Expression
[jira] [Commented] (OOZIE-3270) Upgrade Derby to 10.14.1.0
[ https://issues.apache.org/jira/browse/OOZIE-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496431#comment-16496431 ] Peter Cseh commented on OOZIE-3270: --- It looks like we'll have to copy some code over from HIVE-18586 > Upgrade Derby to 10.14.1.0 > --- > > Key: OOZIE-3270 > URL: https://issues.apache.org/jira/browse/OOZIE-3270 > Project: Oozie > Issue Type: Improvement >Reporter: Peter Cseh >Assignee: Peter Cseh >Priority: Major > > We should upgrade Derby to 10.14.1.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Success: OOZIE-3260 PreCommit Build #594
Jira: https://issues.apache.org/jira/browse/OOZIE-3260 Build: https://builds.apache.org/job/PreCommit-OOZIE-Build/594/ ### ## LAST 100 LINES OF THE CONSOLE ### [...truncated 1.71 MB...] [DEBUG] There are no new bugs found in [sharelib/pig]. [TRACE] New XMLLib present, calling 'xmllint --xpath' to get bug instance counts [DEBUG] There are no new bugs found in [sharelib/streaming]. [TRACE] New XMLLib present, calling 'xmllint --xpath' to get bug instance counts [DEBUG] There are no new bugs found in [sharelib/hive]. [TRACE] New XMLLib present, calling 'xmllint --xpath' to get bug instance counts [DEBUG] There are no new bugs found in [sharelib/hcatalog]. [TRACE] New XMLLib present, calling 'xmllint --xpath' to get bug instance counts [DEBUG] There are no new bugs found in [sharelib/sqoop]. [TRACE] New XMLLib present, calling 'xmllint --xpath' to get bug instance counts [DEBUG] There are no new bugs found in [sharelib/oozie]. [TRACE] New XMLLib present, calling 'xmllint --xpath' to get bug instance counts [DEBUG] There are no new bugs found in [sharelib/distcp]. [TRACE] New XMLLib present, calling 'xmllint --xpath' to get bug instance counts [DEBUG] There are no new bugs found in [sharelib/spark]. [TRACE] New XMLLib present, calling 'xmllint --xpath' to get bug instance counts [DEBUG] There are no new bugs found in [client]. [INFO] There are no new bugs found totally]. [TRACE] FindBugs diffs checked and reports created [TRACE] Summary file size is 2365 bytes [TRACE] Full summary file size is 1314 bytes [TRACE] File [/home/jenkins/jenkins-slave/workspace/PreCommit-OOZIE-Build/test-patch/tmp/FINDBUGS_DIFF/diff/findbugs-diff-0.1.0-all.jar] removed [TRACE] File [/home/jenkins/jenkins-slave/workspace/PreCommit-OOZIE-Build/test-patch/tmp/FINDBUGS_DIFF/diff/findbugs-diff-0.1.0-all.jar.md5sum] removed Running test-patch task BACKWARDS_COMPATIBILITY Running test-patch task TESTS Running test-patch task DISTRO Testing JIRA OOZIE-3260 Cleaning local git workspace +1 PATCH_APPLIES +1 CLEAN +1 RAW_PATCH_ANALYSIS +1 the patch does not introduce any @author tags +1 the patch does not introduce any tabs +1 the patch does not introduce any trailing spaces +1 the patch does not introduce any line longer than 132 +1 the patch adds/modifies 2 testcase(s) +1 RAT +1 the patch does not seem to introduce new RAT warnings +1 JAVADOC +1 JAVADOC +1 the patch does not seem to introduce new Javadoc warning(s) +1 the patch does not seem to introduce new Javadoc error(s) ERROR: the current HEAD has 2 Javadoc error(s) +1 COMPILE +1 HEAD compiles +1 patch compiles +1 the patch does not seem to introduce new javac warnings +1 There are no new bugs found in total. +1 There are no new bugs found in [examples]. +1 There are no new bugs found in [webapp]. +1 There are no new bugs found in [core]. +1 There are no new bugs found in [tools]. +1 There are no new bugs found in [server]. +1 There are no new bugs found in [docs]. +1 There are no new bugs found in [sharelib/hive2]. +1 There are no new bugs found in [sharelib/pig]. +1 There are no new bugs found in [sharelib/streaming]. +1 There are no new bugs found in [sharelib/hive]. +1 There are no new bugs found in [sharelib/hcatalog]. +1 There are no new bugs found in [sharelib/sqoop]. +1 There are no new bugs found in [sharelib/oozie]. +1 There are no new bugs found in [sharelib/distcp]. +1 There are no new bugs found in [sharelib/spark]. +1 There are no new bugs found in [client]. +1 BACKWARDS_COMPATIBILITY +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations +1 the patch does not modify JPA files +1 TESTS Tests run: 2134 +1 DISTRO +1 distro tarball builds with the patch +1 Overall result, good!, no -1s The full output of the test-patch run is available at https://builds.apache.org/job/PreCommit-OOZIE-Build/594/ Adding comment to JIRA % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 0 00 00 0 0 0 --:--:-- --:--:-- --:--:-- 0100 30020 0 100 3002 0 3081 --:--:-- --:--:-- --:--:-- 3078{"self":"https://issues.apache.org/jira/rest/api/2/issue/13162142/comment/16496423","id":"16496423","author":{"self":"https://issues.apache.org/jira/rest/api/2/user?username=hadoopqa","name":"hadoopqa","key":"hadoopqa","emailAddress":"blackhole at hadoop dot apache dot
[jira] [Commented] (OOZIE-3260) [sla] Remove stale item above max retries on JPA related errors from in-memory SLA map
[ https://issues.apache.org/jira/browse/OOZIE-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496423#comment-16496423 ] Hadoop QA commented on OOZIE-3260: -- Testing JIRA OOZIE-3260 Cleaning local git workspace {color:green}+1 PATCH_APPLIES{color} {color:green}+1 CLEAN{color} {color:green}+1 RAW_PATCH_ANALYSIS{color} .{color:green}+1{color} the patch does not introduce any @author tags .{color:green}+1{color} the patch does not introduce any tabs .{color:green}+1{color} the patch does not introduce any trailing spaces .{color:green}+1{color} the patch does not introduce any line longer than 132 .{color:green}+1{color} the patch adds/modifies 2 testcase(s) {color:green}+1 RAT{color} .{color:green}+1{color} the patch does not seem to introduce new RAT warnings {color:green}+1 JAVADOC{color} {color:green}+1 JAVADOC{color} .{color:green}+1{color} the patch does not seem to introduce new Javadoc warning(s) .{color:green}+1{color} the patch does not seem to introduce new Javadoc error(s) .{color:red}ERROR{color}: the current HEAD has 2 Javadoc error(s) {color:green}+1 COMPILE{color} .{color:green}+1{color} HEAD compiles .{color:green}+1{color} patch compiles .{color:green}+1{color} the patch does not seem to introduce new javac warnings {color:green}+1{color} There are no new bugs found in total. . {color:green}+1{color} There are no new bugs found in [examples]. . {color:green}+1{color} There are no new bugs found in [webapp]. . {color:green}+1{color} There are no new bugs found in [core]. . {color:green}+1{color} There are no new bugs found in [tools]. . {color:green}+1{color} There are no new bugs found in [server]. . {color:green}+1{color} There are no new bugs found in [docs]. . {color:green}+1{color} There are no new bugs found in [sharelib/hive2]. . {color:green}+1{color} There are no new bugs found in [sharelib/pig]. . {color:green}+1{color} There are no new bugs found in [sharelib/streaming]. . {color:green}+1{color} There are no new bugs found in [sharelib/hive]. . {color:green}+1{color} There are no new bugs found in [sharelib/hcatalog]. . {color:green}+1{color} There are no new bugs found in [sharelib/sqoop]. . {color:green}+1{color} There are no new bugs found in [sharelib/oozie]. . {color:green}+1{color} There are no new bugs found in [sharelib/distcp]. . {color:green}+1{color} There are no new bugs found in [sharelib/spark]. . {color:green}+1{color} There are no new bugs found in [client]. {color:green}+1 BACKWARDS_COMPATIBILITY{color} .{color:green}+1{color} the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations .{color:green}+1{color} the patch does not modify JPA files {color:green}+1 TESTS{color} .Tests run: 2134 {color:green}+1 DISTRO{color} .{color:green}+1{color} distro tarball builds with the patch {color:green}*+1 Overall result, good!, no -1s*{color} The full output of the test-patch run is available at . https://builds.apache.org/job/PreCommit-OOZIE-Build/594/ > [sla] Remove stale item above max retries on JPA related errors from > in-memory SLA map > -- > > Key: OOZIE-3260 > URL: https://issues.apache.org/jira/browse/OOZIE-3260 > Project: Oozie > Issue Type: Bug > Components: coordinator, core, workflow >Affects Versions: 5.0.0 >Reporter: Andras Piros >Assignee: Andras Piros >Priority: Major > Attachments: OOZIE-3260.001.patch > > > Despite having implemented OOZIE-3134, there are still cases where > {{SLACalculatorMemory#slaMap}} and database contents still get out of sync. > Some possibilities including but not limited to: > * database contents of {{SLA_SUMMARY}} table have been purged manually from > DB > * no corresponding {{WF_JOBS}} or {{COORD_JOBS}} entries exist anymore in DB > * the {{WF_JOBS}} or {{COORD_JOBS}} instance that is being tracked by the > {{SLACalcStatus}} instances inside {{SLACalculatorMemory#slaMap}} is not yet > persisted to database when the SLA entry is already processed by > {{SLACalculatorMemory.HistoryPurgeWorker}}. Depending on e.g. how many > coordinator actions are being materialized, it can very well happen that > {{SLACalcStatus}} entries inserted to the in-memory map will be processed > before their corresponding {{CoordActionBean}} entries are yet to be > persisted to database > In those rare cases, we see {{JPAExecutorException}} instances like: > {noformat} > 2017-10-09 17:00:18,185 DEBUG openjpa.jdbc.SQL: SERVER[HOST] conn 1584126245> [0 ms] spent > 2017-10-09 17:00:18,185 ERROR org.apache.oozie.sla.SLACalculatorMemory: > SERVER[tplhc01c001.iuser.iroot.adidom.com] USER[-] GROUP[-] TOKEN[-] APP[-] >
[jira] [Created] (OOZIE-3270) Upgrade Derby to 10.14.1.0
Peter Cseh created OOZIE-3270: - Summary: Upgrade Derby to 10.14.1.0 Key: OOZIE-3270 URL: https://issues.apache.org/jira/browse/OOZIE-3270 Project: Oozie Issue Type: Improvement Reporter: Peter Cseh Assignee: Peter Cseh We should upgrade Derby to 10.14.1.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (OOZIE-3260) [sla] Remove stale item above max retries on JPA related errors from in-memory SLA map
[ https://issues.apache.org/jira/browse/OOZIE-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andras Piros updated OOZIE-3260: Summary: [sla] Remove stale item above max retries on JPA related errors from in-memory SLA map (was: [sla] Remove stale item on JPA related errors from in-memory SLA map) > [sla] Remove stale item above max retries on JPA related errors from > in-memory SLA map > -- > > Key: OOZIE-3260 > URL: https://issues.apache.org/jira/browse/OOZIE-3260 > Project: Oozie > Issue Type: Bug > Components: coordinator, core, workflow >Affects Versions: 5.0.0 >Reporter: Andras Piros >Assignee: Andras Piros >Priority: Major > > Despite having implemented OOZIE-3134, there are still cases where > {{SLACalculatorMemory#slaMap}} and database contents still get out of sync. > Some possibilities including but not limited to: > * database contents of {{SLA_SUMMARY}} table have been purged manually from > DB > * no corresponding {{WF_JOBS}} or {{COORD_JOBS}} entries exist anymore in DB > * the {{WF_JOBS}} or {{COORD_JOBS}} instance that is being tracked by the > {{SLACalcStatus}} instances inside {{SLACalculatorMemory#slaMap}} is not yet > persisted to database when the SLA entry is already processed by > {{SLACalculatorMemory.HistoryPurgeWorker}}. Depending on e.g. how many > coordinator actions are being materialized, it can very well happen that > {{SLACalcStatus}} entries inserted to the in-memory map will be processed > before their corresponding {{CoordActionBean}} entries are yet to be > persisted to database > In those rare cases, we see {{JPAExecutorException}} instances like: > {noformat} > 2017-10-09 17:00:18,185 DEBUG openjpa.jdbc.SQL: SERVER[HOST] conn 1584126245> [0 ms] spent > 2017-10-09 17:00:18,185 ERROR org.apache.oozie.sla.SLACalculatorMemory: > SERVER[tplhc01c001.iuser.iroot.adidom.com] USER[-] GROUP[-] TOKEN[-] APP[-] > JOB[438-170916014916144-oozie-oozi-C@556] ACTION[-] Exception in SLA > processing for job [438-170916014916144-oozie-oozi-C@556] > org.apache.oozie.executor.jpa.JPAExecutorException: E0604: Job does not exist > [select w.eventProcessed from SLASummaryBean w where w.jobId = :id] > at > org.apache.oozie.executor.jpa.SLASummaryQueryExecutor.getSingleValue(SLASummaryQueryExecutor.java:161) > at > org.apache.oozie.sla.SLACalculatorMemory.updateJobSla(SLACalculatorMemory.java:480) > at > org.apache.oozie.sla.SLACalculatorMemory.updateAllSlaStatus(SLACalculatorMemory.java:601) > {noformat} > Solution here is to track the number of times the {{SLACalcStatus}} entry has > not been processed successfully, and when a preconfigured > {{oozie.sla.service.SLAService.maximum.retry.count}} is reached, remove any > {{SLACalculatorMemory#slaMap}} entries that are causing those > {{JPAExecutorException}} instances, to not cause huge logfiles. The items to > be logged don't exist, anyways. > It's still possible that multiple {{CoordActionBean}} instances being > inserted won't have {{SLACalcStatus}} entries inside > {{SLACalculatorMemory#slaMap}} by the time written to database, and thus, no > SLA will be tracked. In those rare cases, preconfigured maximum retry count > can be extended. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (OOZIE-3260) [sla] Remove stale item on JPA related errors from in-memory SLA map
[ https://issues.apache.org/jira/browse/OOZIE-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andras Piros updated OOZIE-3260: Summary: [sla] Remove stale item on JPA related errors from in-memory SLA map (was: [sla] Remove item on JPA related errors from in-memory SLA map) > [sla] Remove stale item on JPA related errors from in-memory SLA map > > > Key: OOZIE-3260 > URL: https://issues.apache.org/jira/browse/OOZIE-3260 > Project: Oozie > Issue Type: Bug > Components: coordinator, core, workflow >Affects Versions: 5.0.0 >Reporter: Andras Piros >Assignee: Andras Piros >Priority: Major > > Despite having implemented OOZIE-3134, there are still cases where > {{SLACalculatorMemory#slaMap}} and database contents still get out of sync. > Some possibilities including but not limited to: > * database contents of {{SLA_SUMMARY}} table have been purged manually from > DB > * no corresponding {{WF_JOBS}} or {{COORD_JOBS}} entries exist anymore in DB > * the {{WF_JOBS}} or {{COORD_JOBS}} instance that is being tracked by the > {{SLACalcStatus}} instances inside {{SLACalculatorMemory#slaMap}} is not yet > persisted to database when the SLA entry is already processed by > {{SLACalculatorMemory.HistoryPurgeWorker}}. Depending on e.g. how many > coordinator actions are being materialized, it can very well happen that > {{SLACalcStatus}} entries inserted to the in-memory map will be processed > before their corresponding {{CoordActionBean}} entries are yet to be > persisted to database > In those rare cases, we see {{JPAExecutorException}} instances like: > {noformat} > 2017-10-09 17:00:18,185 DEBUG openjpa.jdbc.SQL: SERVER[HOST] conn 1584126245> [0 ms] spent > 2017-10-09 17:00:18,185 ERROR org.apache.oozie.sla.SLACalculatorMemory: > SERVER[tplhc01c001.iuser.iroot.adidom.com] USER[-] GROUP[-] TOKEN[-] APP[-] > JOB[438-170916014916144-oozie-oozi-C@556] ACTION[-] Exception in SLA > processing for job [438-170916014916144-oozie-oozi-C@556] > org.apache.oozie.executor.jpa.JPAExecutorException: E0604: Job does not exist > [select w.eventProcessed from SLASummaryBean w where w.jobId = :id] > at > org.apache.oozie.executor.jpa.SLASummaryQueryExecutor.getSingleValue(SLASummaryQueryExecutor.java:161) > at > org.apache.oozie.sla.SLACalculatorMemory.updateJobSla(SLACalculatorMemory.java:480) > at > org.apache.oozie.sla.SLACalculatorMemory.updateAllSlaStatus(SLACalculatorMemory.java:601) > {noformat} > Solution here is to track the number of times the {{SLACalcStatus}} entry has > not been processed successfully, and when a preconfigured > {{oozie.sla.service.SLAService.maximum.retry.count}} is reached, remove any > {{SLACalculatorMemory#slaMap}} entries that are causing those > {{JPAExecutorException}} instances, to not cause huge logfiles. The items to > be logged don't exist, anyways. > It's still possible that multiple {{CoordActionBean}} instances being > inserted won't have {{SLACalcStatus}} entries inside > {{SLACalculatorMemory#slaMap}} by the time written to database, and thus, no > SLA will be tracked. In those rare cases, preconfigured maximum retry count > can be extended. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (OOZIE-3260) [sla] Remove stale item above max retries on JPA related errors from in-memory SLA map
[ https://issues.apache.org/jira/browse/OOZIE-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andras Piros updated OOZIE-3260: Attachment: OOZIE-3260.001.patch > [sla] Remove stale item above max retries on JPA related errors from > in-memory SLA map > -- > > Key: OOZIE-3260 > URL: https://issues.apache.org/jira/browse/OOZIE-3260 > Project: Oozie > Issue Type: Bug > Components: coordinator, core, workflow >Affects Versions: 5.0.0 >Reporter: Andras Piros >Assignee: Andras Piros >Priority: Major > Attachments: OOZIE-3260.001.patch > > > Despite having implemented OOZIE-3134, there are still cases where > {{SLACalculatorMemory#slaMap}} and database contents still get out of sync. > Some possibilities including but not limited to: > * database contents of {{SLA_SUMMARY}} table have been purged manually from > DB > * no corresponding {{WF_JOBS}} or {{COORD_JOBS}} entries exist anymore in DB > * the {{WF_JOBS}} or {{COORD_JOBS}} instance that is being tracked by the > {{SLACalcStatus}} instances inside {{SLACalculatorMemory#slaMap}} is not yet > persisted to database when the SLA entry is already processed by > {{SLACalculatorMemory.HistoryPurgeWorker}}. Depending on e.g. how many > coordinator actions are being materialized, it can very well happen that > {{SLACalcStatus}} entries inserted to the in-memory map will be processed > before their corresponding {{CoordActionBean}} entries are yet to be > persisted to database > In those rare cases, we see {{JPAExecutorException}} instances like: > {noformat} > 2017-10-09 17:00:18,185 DEBUG openjpa.jdbc.SQL: SERVER[HOST] conn 1584126245> [0 ms] spent > 2017-10-09 17:00:18,185 ERROR org.apache.oozie.sla.SLACalculatorMemory: > SERVER[tplhc01c001.iuser.iroot.adidom.com] USER[-] GROUP[-] TOKEN[-] APP[-] > JOB[438-170916014916144-oozie-oozi-C@556] ACTION[-] Exception in SLA > processing for job [438-170916014916144-oozie-oozi-C@556] > org.apache.oozie.executor.jpa.JPAExecutorException: E0604: Job does not exist > [select w.eventProcessed from SLASummaryBean w where w.jobId = :id] > at > org.apache.oozie.executor.jpa.SLASummaryQueryExecutor.getSingleValue(SLASummaryQueryExecutor.java:161) > at > org.apache.oozie.sla.SLACalculatorMemory.updateJobSla(SLACalculatorMemory.java:480) > at > org.apache.oozie.sla.SLACalculatorMemory.updateAllSlaStatus(SLACalculatorMemory.java:601) > {noformat} > Solution here is to track the number of times the {{SLACalcStatus}} entry has > not been processed successfully, and when a preconfigured > {{oozie.sla.service.SLAService.maximum.retry.count}} is reached, remove any > {{SLACalculatorMemory#slaMap}} entries that are causing those > {{JPAExecutorException}} instances, to not cause huge logfiles. The items to > be logged don't exist, anyways. > It's still possible that multiple {{CoordActionBean}} instances being > inserted won't have {{SLACalcStatus}} entries inside > {{SLACalculatorMemory#slaMap}} by the time written to database, and thus, no > SLA will be tracked. In those rare cases, preconfigured maximum retry count > can be extended. > Note that current implementation of > [*{{SLACalculatorMemory#updateJobSla()}}*|https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/sla/SLACalculatorMemory.java#L238-L242] > already removes the stale {{SLACalcStatus}} entry. The new functionality > here is to introduce {{SLACalcStatus#retryCount}}, and extend the > {{JPAExecutorException}} {{ErrorCode}}s of interest. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (OOZIE-3156) SSH action status turns OK wrongly when failed to connect to host
[ https://issues.apache.org/jira/browse/OOZIE-3156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496381#comment-16496381 ] Andras Piros commented on OOZIE-3156: - Thanks for the new patch [~txsing]! Following is the next round on comments: * {{SshActionExecutor#handleRetry()}}: {{sleepBeforeRetryMs /= 2;}} should rather be {{sleepBeforeRetryMs *= 2;}} * the return value of {{SshActionExecutor#handleRetry()}} is not reused in caller code, so it doesn't get really an exponential backoff - {{initWaitTime}} will always be reused * in {{TestSshActionExecutor#testSshCheckWithHostConnectFailure()}} it's unclear to me whether {{echo "prop1=something"}} would always fail for the first time. We need to inject failure somehow to be on the safe side, or, if already present, extract methods of the test case w/ appropriate names to know what's going on * extending {{DG_SshActionExtension.twiki}} goes into the right direction. Still, we need to introduce {{oozie-default.xml#oozie.action.ssh.check.retries.max}} with the default value {{3}}, and mention it also in the docs > SSH action status turns OK wrongly when failed to connect to host > - > > Key: OOZIE-3156 > URL: https://issues.apache.org/jira/browse/OOZIE-3156 > Project: Oozie > Issue Type: Bug > Components: action >Affects Versions: 5.0.0 >Reporter: TIAN XING >Assignee: TIAN XING >Priority: Major > Attachments: OOZIE-3156-v1.patch, OOZIE-3156-v2.patch, > OOZIE-3156-v3.patch, ssh-check-bug.patch > > > When {{check()}} method of {{SshActionExecutor}} gets invoked, oozie will ssh > connect to the host and check whether the pid of the process that ssh action > started is still there (by checking the returned value of command "{{ssh > ps -p }}" ) to determine whether ssh action completes or not. > However, we found cases where oozie fails to connect to host during action > status check (e.g., the host is under heavy load, or network is bad etc.). > In such cases, the return value of command "{{ssh ps -p }}" > will be 255 (ssh command exits with the exit status of the remote command or > with 255 if an error occurred.). > According the current logic of method {{getActionStatus()}} in > {{SshActionExecutor}}, the action status will be determined as OK which may > not be correct. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (OOZIE-3260) [sla] Remove stale item above max retries on JPA related errors from in-memory SLA map
[ https://issues.apache.org/jira/browse/OOZIE-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496361#comment-16496361 ] Hadoop QA commented on OOZIE-3260: -- PreCommit-OOZIE-Build started > [sla] Remove stale item above max retries on JPA related errors from > in-memory SLA map > -- > > Key: OOZIE-3260 > URL: https://issues.apache.org/jira/browse/OOZIE-3260 > Project: Oozie > Issue Type: Bug > Components: coordinator, core, workflow >Affects Versions: 5.0.0 >Reporter: Andras Piros >Assignee: Andras Piros >Priority: Major > Attachments: OOZIE-3260.001.patch > > > Despite having implemented OOZIE-3134, there are still cases where > {{SLACalculatorMemory#slaMap}} and database contents still get out of sync. > Some possibilities including but not limited to: > * database contents of {{SLA_SUMMARY}} table have been purged manually from > DB > * no corresponding {{WF_JOBS}} or {{COORD_JOBS}} entries exist anymore in DB > * the {{WF_JOBS}} or {{COORD_JOBS}} instance that is being tracked by the > {{SLACalcStatus}} instances inside {{SLACalculatorMemory#slaMap}} is not yet > persisted to database when the SLA entry is already processed by > {{SLACalculatorMemory.HistoryPurgeWorker}}. Depending on e.g. how many > coordinator actions are being materialized, it can very well happen that > {{SLACalcStatus}} entries inserted to the in-memory map will be processed > before their corresponding {{CoordActionBean}} entries are yet to be > persisted to database > In those rare cases, we see {{JPAExecutorException}} instances like: > {noformat} > 2017-10-09 17:00:18,185 DEBUG openjpa.jdbc.SQL: SERVER[HOST] conn 1584126245> [0 ms] spent > 2017-10-09 17:00:18,185 ERROR org.apache.oozie.sla.SLACalculatorMemory: > SERVER[tplhc01c001.iuser.iroot.adidom.com] USER[-] GROUP[-] TOKEN[-] APP[-] > JOB[438-170916014916144-oozie-oozi-C@556] ACTION[-] Exception in SLA > processing for job [438-170916014916144-oozie-oozi-C@556] > org.apache.oozie.executor.jpa.JPAExecutorException: E0604: Job does not exist > [select w.eventProcessed from SLASummaryBean w where w.jobId = :id] > at > org.apache.oozie.executor.jpa.SLASummaryQueryExecutor.getSingleValue(SLASummaryQueryExecutor.java:161) > at > org.apache.oozie.sla.SLACalculatorMemory.updateJobSla(SLACalculatorMemory.java:480) > at > org.apache.oozie.sla.SLACalculatorMemory.updateAllSlaStatus(SLACalculatorMemory.java:601) > {noformat} > Solution here is to track the number of times the {{SLACalcStatus}} entry has > not been processed successfully, and when a preconfigured > {{oozie.sla.service.SLAService.maximum.retry.count}} is reached, remove any > {{SLACalculatorMemory#slaMap}} entries that are causing those > {{JPAExecutorException}} instances, to not cause huge logfiles. The items to > be logged don't exist, anyways. > It's still possible that multiple {{CoordActionBean}} instances being > inserted won't have {{SLACalcStatus}} entries inside > {{SLACalculatorMemory#slaMap}} by the time written to database, and thus, no > SLA will be tracked. In those rare cases, preconfigured maximum retry count > can be extended. > Note that current implementation of > [*{{SLACalculatorMemory#updateJobSla()}}*|https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/sla/SLACalculatorMemory.java#L238-L242] > already removes the stale {{SLACalcStatus}} entry. The new functionality > here is to introduce {{SLACalcStatus#retryCount}}, and extend the > {{JPAExecutorException}} {{ErrorCode}}s of interest. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (OOZIE-3269) Flaky tests in TestCoordMaterializeTriggerService class
[ https://issues.apache.org/jira/browse/OOZIE-3269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496382#comment-16496382 ] Andras Piros commented on OOZIE-3269: - Thanks for the contribution [~pbacsko]! +1 > Flaky tests in TestCoordMaterializeTriggerService class > --- > > Key: OOZIE-3269 > URL: https://issues.apache.org/jira/browse/OOZIE-3269 > Project: Oozie > Issue Type: Sub-task > Components: coordinator, core, tests >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: OOZIE-3269-001.patch > > > The tests in TestCoordMaterializeTriggerService can fail with various > problems. > For example, {{testCoordMaterializeTriggerService3}} typically fails with: > {noformat} > junit.framework.AssertionFailedError: expected: but was: > at junit.framework.Assert.fail(Assert.java:57) > at junit.framework.Assert.failNotEquals(Assert.java:329) > at junit.framework.Assert.assertEquals(Assert.java:78) > at junit.framework.Assert.assertEquals(Assert.java:86) > at junit.framework.TestCase.assertEquals(TestCase.java:253) > at > org.apache.oozie.service.TestCoordMaterializeTriggerService.testCoordMaterializeTriggerService3(TestCoordMaterializeTriggerService.java:151) > {noformat} > The reason is that {{CoordMaterializeTriggerService}} is running in the > background which is what we're trying to test and it interferes with the test > execution: > {noformat} > 05:59:17,474 [CallableQueue-2] DEBUG CoordMaterializeTransitionXCommand:526 - > USER[test] GROUP[testg] TOKEN[] APP[COORD-TEST] > JOB[002-180529055913420-oozie-root-C] ACTION[-] Coordinator job > :002-180529055913420-oozie-root-C, maxActionToBeCreated :1, Mat_Throttle > :1, numWaitingActions :0 > 05:59:17,475 [CallableQueue-2] DEBUG CoordMaterializeTransitionXCommand:526 - > USER[test] GROUP[testg] TOKEN[] APP[COORD-TEST] > JOB[002-180529055913420-oozie-root-C] ACTION[-] Materializing action for > time=2018-05-29T12:59Z, lastactionnumber=1 timeout=0 minutes > 05:59:17,475 [CallableQueue-2] WARN DateUtils:523 - USER[test] GROUP[testg] > TOKEN[] APP[COORD-TEST] JOB[002-180529055913420-oozie-root-C] ACTION[-] > GMT, UTC or Region/City Timezone formats are preferred instead of > America/Los_Angeles > 05:59:17,475 [CallableQueue-2] DEBUG CoordMaterializeTransitionXCommand:526 - > USER[test] GROUP[testg] TOKEN[] APP[COORD-TEST] > JOB[002-180529055913420-oozie-root-C] ACTION[-] In storeToDB() coord > action id = 002-180529055913420-oozie-root-C@1, size of actionXml = 1129 > 05:59:17,476 [CallableQueue-2] DEBUG CoordMaterializeTransitionXCommand:526 - > USER[test] GROUP[testg] TOKEN[] APP[COORD-TEST] > JOB[002-180529055913420-oozie-root-C] ACTION[-] Not registering SLA for > job [002-180529055913420-oozie-root-C@1]. Sla-Xml null OR SLAService not > enabled > 05:59:17,476 [CallableQueue-2] INFO CoordMaterializeTransitionXCommand:520 - > USER[test] GROUP[testg] TOKEN[] APP[COORD-TEST] > JOB[002-180529055913420-oozie-root-C] ACTION[-] > [002-180529055913420-oozie-root-C]: Update status from PREP to RUNNING > 05:59:17,476 [CallableQueue-2] WARN DateUtils:523 - USER[test] GROUP[testg] > TOKEN[] APP[COORD-TEST] JOB[002-180529055913420-oozie-root-C] ACTION[-] > GMT, UTC or Region/City Timezone formats are preferred instead of > America/Los_Angeles > 05:59:17,476 [CallableQueue-2] INFO CoordMaterializeTransitionXCommand:520 - > USER[test] GROUP[testg] TOKEN[] APP[COORD-TEST] > JOB[002-180529055913420-oozie-root-C] ACTION[-] > [002-180529055913420-oozie-root-C]: all actions have been materialized, > set pending to true > 05:59:17,476 [CallableQueue-2] INFO CoordMaterializeTransitionXCommand:520 - > USER[test] GROUP[testg] TOKEN[] APP[COORD-TEST] > JOB[002-180529055913420-oozie-root-C] ACTION[-] Coord Job status updated > to = RUNNING > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (OOZIE-3260) [sla] Remove stale item above max retries on JPA related errors from in-memory SLA map
[ https://issues.apache.org/jira/browse/OOZIE-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andras Piros updated OOZIE-3260: Description: Despite having implemented OOZIE-3134, there are still cases where {{SLACalculatorMemory#slaMap}} and database contents still get out of sync. Some possibilities including but not limited to: * database contents of {{SLA_SUMMARY}} table have been purged manually from DB * no corresponding {{WF_JOBS}} or {{COORD_JOBS}} entries exist anymore in DB * the {{WF_JOBS}} or {{COORD_JOBS}} instance that is being tracked by the {{SLACalcStatus}} instances inside {{SLACalculatorMemory#slaMap}} is not yet persisted to database when the SLA entry is already processed by {{SLACalculatorMemory.HistoryPurgeWorker}}. Depending on e.g. how many coordinator actions are being materialized, it can very well happen that {{SLACalcStatus}} entries inserted to the in-memory map will be processed before their corresponding {{CoordActionBean}} entries are yet to be persisted to database In those rare cases, we see {{JPAExecutorException}} instances like: {noformat} 2017-10-09 17:00:18,185 DEBUG openjpa.jdbc.SQL: SERVER[HOST] [0 ms] spent 2017-10-09 17:00:18,185 ERROR org.apache.oozie.sla.SLACalculatorMemory: SERVER[tplhc01c001.iuser.iroot.adidom.com] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[438-170916014916144-oozie-oozi-C@556] ACTION[-] Exception in SLA processing for job [438-170916014916144-oozie-oozi-C@556] org.apache.oozie.executor.jpa.JPAExecutorException: E0604: Job does not exist [select w.eventProcessed from SLASummaryBean w where w.jobId = :id] at org.apache.oozie.executor.jpa.SLASummaryQueryExecutor.getSingleValue(SLASummaryQueryExecutor.java:161) at org.apache.oozie.sla.SLACalculatorMemory.updateJobSla(SLACalculatorMemory.java:480) at org.apache.oozie.sla.SLACalculatorMemory.updateAllSlaStatus(SLACalculatorMemory.java:601) {noformat} Solution here is to track the number of times the {{SLACalcStatus}} entry has not been processed successfully, and when a preconfigured {{oozie.sla.service.SLAService.maximum.retry.count}} is reached, remove any {{SLACalculatorMemory#slaMap}} entries that are causing those {{JPAExecutorException}} instances, to not cause huge logfiles. The items to be logged don't exist, anyways. It's still possible that multiple {{CoordActionBean}} instances being inserted won't have {{SLACalcStatus}} entries inside {{SLACalculatorMemory#slaMap}} by the time written to database, and thus, no SLA will be tracked. In those rare cases, preconfigured maximum retry count can be extended. Note that current implementation of [*{{SLACalculatorMemory#updateJobSla()}}*|https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/sla/SLACalculatorMemory.java#L238-L242] already removes the stale {{SLACalcStatus}} entry. The new functionality here is to introduce {{SLACalcStatus#retryCount}}, and extend the {{JPAExecutorException}} {{ErrorCode}}s of interest. was: Despite having implemented OOZIE-3134, there are still cases where {{SLACalculatorMemory#slaMap}} and database contents still get out of sync. Some possibilities including but not limited to: * database contents of {{SLA_SUMMARY}} table have been purged manually from DB * no corresponding {{WF_JOBS}} or {{COORD_JOBS}} entries exist anymore in DB * the {{WF_JOBS}} or {{COORD_JOBS}} instance that is being tracked by the {{SLACalcStatus}} instances inside {{SLACalculatorMemory#slaMap}} is not yet persisted to database when the SLA entry is already processed by {{SLACalculatorMemory.HistoryPurgeWorker}}. Depending on e.g. how many coordinator actions are being materialized, it can very well happen that {{SLACalcStatus}} entries inserted to the in-memory map will be processed before their corresponding {{CoordActionBean}} entries are yet to be persisted to database In those rare cases, we see {{JPAExecutorException}} instances like: {noformat} 2017-10-09 17:00:18,185 DEBUG openjpa.jdbc.SQL: SERVER[HOST] [0 ms] spent 2017-10-09 17:00:18,185 ERROR org.apache.oozie.sla.SLACalculatorMemory: SERVER[tplhc01c001.iuser.iroot.adidom.com] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[438-170916014916144-oozie-oozi-C@556] ACTION[-] Exception in SLA processing for job [438-170916014916144-oozie-oozi-C@556] org.apache.oozie.executor.jpa.JPAExecutorException: E0604: Job does not exist [select w.eventProcessed from SLASummaryBean w where w.jobId = :id] at org.apache.oozie.executor.jpa.SLASummaryQueryExecutor.getSingleValue(SLASummaryQueryExecutor.java:161) at org.apache.oozie.sla.SLACalculatorMemory.updateJobSla(SLACalculatorMemory.java:480) at org.apache.oozie.sla.SLACalculatorMemory.updateAllSlaStatus(SLACalculatorMemory.java:601) {noformat} Solution here is to track the number of times the {{SLACalcStatus}} entry has not been processed
[jira] [Commented] (OOZIE-3271) Flaky tests in TestCoordInputLogicPush class
[ https://issues.apache.org/jira/browse/OOZIE-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497055#comment-16497055 ] Peter Bacsko commented on OOZIE-3271: - I was able to reproduce this locally. It was running on DEBUG level and the root cause seems to be an excessive amount of calls to log4j. There's a synchronized method somewhere (or probably more than one) which slows down everything if multiple threads are logging extensively. Although I remember seeing these tests fail on INFO level, too. > Flaky tests in TestCoordInputLogicPush class > > > Key: OOZIE-3271 > URL: https://issues.apache.org/jira/browse/OOZIE-3271 > Project: Oozie > Issue Type: Sub-task >Reporter: Peter Bacsko >Priority: Major > > Running locally, two tests in {{TestCoordInputLogicPush}} failed: > {noformat} > testLatestRange(org.apache.oozie.coord.input.logic.TestCoordInputLogicPush) > Time elapsed: 132.734 s <<< FAILURE! > junit.framework.AssertionFailedError: Action status should not be waiting > at junit.framework.Assert.fail(Assert.java:57) > at junit.framework.Assert.assertTrue(Assert.java:22) > at junit.framework.Assert.assertFalse(Assert.java:39) > at junit.framework.TestCase.assertFalse(TestCase.java:210) > at > org.apache.oozie.coord.input.logic.TestCoordInputLogicPush.startCoordAction(TestCoordInputLogicPush.java:570) > at > org.apache.oozie.coord.input.logic.TestCoordInputLogicPush.testLatestRange(TestCoordInputLogicPush.java:255) > {noformat} > {noformat} > testLatestRangeComplex(org.apache.oozie.coord.input.logic.TestCoordInputLogicPush) > Time elapsed: 159.055 s <<< FAILURE! > junit.framework.AssertionFailedError: Action status should not be waiting > at junit.framework.Assert.fail(Assert.java:57) > at junit.framework.Assert.assertTrue(Assert.java:22) > at junit.framework.Assert.assertFalse(Assert.java:39) > at junit.framework.TestCase.assertFalse(TestCase.java:210) > at > org.apache.oozie.coord.input.logic.TestCoordInputLogicPush.startCoordAction(TestCoordInputLogicPush.java:570) > at > org.apache.oozie.coord.input.logic.TestCoordInputLogicPush.testLatestRangeComplex(TestCoordInputLogicPush.java:330) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (OOZIE-3270) Upgrade Derby to 10.14.1.0
[ https://issues.apache.org/jira/browse/OOZIE-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496742#comment-16496742 ] Hadoop QA commented on OOZIE-3270: -- PreCommit-OOZIE-Build started > Upgrade Derby to 10.14.1.0 > --- > > Key: OOZIE-3270 > URL: https://issues.apache.org/jira/browse/OOZIE-3270 > Project: Oozie > Issue Type: Improvement >Reporter: Peter Cseh >Assignee: Peter Cseh >Priority: Major > Attachments: OOZIE-3270.01.patch > > > We should upgrade Derby to 10.14.1.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
REMINDER: Apache EU Roadshow 2018 in Berlin is less than 2 weeks away!
Hello Apache Supporters and Enthusiasts This is a reminder that our Apache EU Roadshow in Berlin is less than two weeks away and we need your help to spread the word. Please let your work colleagues, friends and anyone interested in any attending know about our Apache EU Roadshow event. We have a great schedule including tracks on Apache Tomcat, Apache Http Server, Microservices, Internet of Things (IoT) and Cloud Technologies. You can find more details at the link below: https://s.apache.org/0hnG Ticket prices will be going up on 8^th June 2018, so please make sure that you register soon if you want to beat the price increase. https://foss-backstage.de/tickets Remember that registering for the Apache EU Roadshow also gives you access to FOSS Backstage so you can attend any talks and workshops from both conferences. And don’t forget that our Apache Lounge will be open throughout the whole conference as a place to meet up, hack and relax. We look forward to seeing you in Berlin! Thanks Sharan Foga, VP Apache Community Development http://apachecon.com/ @apachecon PLEASE NOTE: You are receiving this message because you are subscribed to a user@ or dev@ list of one or more Apache Software Foundation projects.
[jira] [Comment Edited] (OOZIE-3271) Flaky tests in TestCoordInputLogicPush class
[ https://issues.apache.org/jira/browse/OOZIE-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496634#comment-16496634 ] Peter Bacsko edited comment on OOZIE-3271 at 5/31/18 2:32 PM: -- Not sure what the root cause here is exacly, but for some reason {{CoordActionInputCheckXCommand}} runs for a very long time. This time it was 2 minutes and it was holding the lock of the coordinator action. Therefore the rest of the {{XCommand}} calls did not achieve anything. {noformat} 07:14:51,788 [CallableQueue-1] DEBUG CoordActionInputCheckXCommand:526 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[000-180529071448012-oozie-root-C] ACTION[000-180529071448012-oozie-root-C@1] Execute command [coord_action_input] key [000-180529071448012-oozie-root-C] 07:14:51,788 [CallableQueue-1] DEBUG CoordActionInputCheckXCommand:526 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[000-180529071448012-oozie-root-C] ACTION[000-180529071448012-oozie-root-C@1] [000-180529071448012-oozie-root-C@1]::ActionInputCheck:: Action is in WAITING state. 07:14:51,790 [CallableQueue-1] INFO CoordActionInputCheckXCommand:520 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[000-180529071448012-oozie-root-C] ACTION[000-180529071448012-oozie-root-C@1] [000-180529071448012-oozie-root-C@1]::CoordActionInputCheck:: Missing deps: ${coord:latestRange(-5,0)}#${coord:latestRange(-5,0)}#${coord:latestRange(-5,0)}#${coord:latestRange(-5,0)}#${coord:latestRange(-5,0)}#${coord:latestRange(-5,0)} 07:14:51,792 [CallableQueue-1] DEBUG CoordInputLogicEvaluatorPhaseOne:526 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[000-180529071448012-oozie-root-C] ACTION[000-180529071448012-oozie-root-C@1] Data set [A] is unresolved set, will get resolved in phase two 07:14:51,792 [CallableQueue-1] DEBUG CoordInputLogicEvaluatorPhaseOne:526 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[000-180529071448012-oozie-root-C] ACTION[000-180529071448012-oozie-root-C@1] Data set [B] is unresolved set, will get resolved in phase two 07:14:51,792 [CallableQueue-1] DEBUG CoordInputLogicEvaluatorUtil:526 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[000-180529071448012-oozie-root-C] ACTION[000-180529071448012-oozie-root-C@1] Input logic expression for [(dependencyBuilder.input("A").build() && dependencyBuilder.input("B").build())] and evaluate result is [PHASE_TWO_EVALUATION] 07:14:51,793 [CallableQueue-1] DEBUG CoordActionInputCheckXCommand:526 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[000-180529071448012-oozie-root-C] ACTION[000-180529071448012-oozie-root-C@1] [000-180529071448012-oozie-root-C@1]::ActionInputCheck:: Checking Latest/future ... (absolutely nothing is logged from this thread for 2 minutes) 07:16:57,191 [CallableQueue-1] DEBUG CoordInputLogicEvaluatorPhaseTwo:526 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[000-180529071448012-oozie-root-C] ACTION[000-180529071448012-oozie-root-C@1] Resolved status of Data set A with min -1 and wait -1 = false 07:16:57,191 [CallableQueue-1] DEBUG CoordInputLogicEvaluatorUtil:526 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[000-180529071448012-oozie-root-C] ACTION[000-180529071448012-oozie-root-C@1] Input logic expression for [(dependencyBuilder.input("A").build() && dependencyBuilder.input("B").build())] and evaluate result is [FALSE] ... 07:16:57,195 [CallableQueue-1] DEBUG CoordActionInputCheckXCommand:526 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[000-180529071448012-oozie-root-C] ACTION[000-180529071448012-oozie-root-C@1] Released lock for [000-180529071448012-oozie-root-C] in [coord_action_input] {noformat} was (Author: pbacsko): Not sure what the root cause here is exacly, but for some reason {{CoordActionInputCheckXCommand}} runs for a very long time. This time it was 2 minutes and it was holding the lock of the coordinator action. Therefore the rest of the {{XCommand}} calls did not achieve anything. {noformat} 07:14:51,788 [CallableQueue-1] DEBUG CoordActionInputCheckXCommand:526 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[000-180529071448012-oozie-root-C] ACTION[000-180529071448012-oozie-root-C@1] Execute command [coord_action_input] key [000-180529071448012-oozie-root-C] 07:14:51,788 [CallableQueue-1] DEBUG CoordActionInputCheckXCommand:526 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[000-180529071448012-oozie-root-C] ACTION[000-180529071448012-oozie-root-C@1] [000-180529071448012-oozie-root-C@1]::ActionInputCheck:: Action is in WAITING state. 07:14:51,790 [CallableQueue-1] INFO CoordActionInputCheckXCommand:520 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[000-180529071448012-oozie-root-C] ACTION[000-180529071448012-oozie-root-C@1] [000-180529071448012-oozie-root-C@1]::CoordActionInputCheck:: Missing deps:
[jira] [Commented] (OOZIE-3271) Flaky tests in TestCoordInputLogicPush class
[ https://issues.apache.org/jira/browse/OOZIE-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496634#comment-16496634 ] Peter Bacsko commented on OOZIE-3271: - Not sure what the root cause here is exacly, but for some reason {{CoordActionInputCheckXCommand}} runs for a very long time. This time it was 2 minutes and it was holding the lock of the coordinator action. Therefore the rest of the {{XCommand}} calls did not achieve anything. {noformat} 07:14:51,788 [CallableQueue-1] DEBUG CoordActionInputCheckXCommand:526 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[000-180529071448012-oozie-root-C] ACTION[000-180529071448012-oozie-root-C@1] Execute command [coord_action_input] key [000-180529071448012-oozie-root-C] 07:14:51,788 [CallableQueue-1] DEBUG CoordActionInputCheckXCommand:526 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[000-180529071448012-oozie-root-C] ACTION[000-180529071448012-oozie-root-C@1] [000-180529071448012-oozie-root-C@1]::ActionInputCheck:: Action is in WAITING state. 07:14:51,790 [CallableQueue-1] INFO CoordActionInputCheckXCommand:520 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[000-180529071448012-oozie-root-C] ACTION[000-180529071448012-oozie-root-C@1] [000-180529071448012-oozie-root-C@1]::CoordActionInputCheck:: Missing deps: ${coord:latestRange(-5,0)}#${coord:latestRange(-5,0)}#${coord:latestRange(-5,0)}#${coord:latestRange(-5,0)}#${coord:latestRange(-5,0)}#${coord:latestRange(-5,0)} 07:14:51,792 [CallableQueue-1] DEBUG CoordInputLogicEvaluatorPhaseOne:526 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[000-180529071448012-oozie-root-C] ACTION[000-180529071448012-oozie-root-C@1] Data set [A] is unresolved set, will get resolved in phase two 07:14:51,792 [CallableQueue-1] DEBUG CoordInputLogicEvaluatorPhaseOne:526 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[000-180529071448012-oozie-root-C] ACTION[000-180529071448012-oozie-root-C@1] Data set [B] is unresolved set, will get resolved in phase two 07:14:51,792 [CallableQueue-1] DEBUG CoordInputLogicEvaluatorUtil:526 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[000-180529071448012-oozie-root-C] ACTION[000-180529071448012-oozie-root-C@1] Input logic expression for [(dependencyBuilder.input("A").build() && dependencyBuilder.input("B").build())] and evaluate result is [PHASE_TWO_EVALUATION] 07:14:51,793 [CallableQueue-1] DEBUG CoordActionInputCheckXCommand:526 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[000-180529071448012-oozie-root-C] ACTION[000-180529071448012-oozie-root-C@1] [000-180529071448012-oozie-root-C@1]::ActionInputCheck:: Checking Latest/future ... 07:16:57,191 [CallableQueue-1] DEBUG CoordInputLogicEvaluatorPhaseTwo:526 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[000-180529071448012-oozie-root-C] ACTION[000-180529071448012-oozie-root-C@1] Resolved status of Data set A with min -1 and wait -1 = false 07:16:57,191 [CallableQueue-1] DEBUG CoordInputLogicEvaluatorUtil:526 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[000-180529071448012-oozie-root-C] ACTION[000-180529071448012-oozie-root-C@1] Input logic expression for [(dependencyBuilder.input("A").build() && dependencyBuilder.input("B").build())] and evaluate result is [FALSE] ... 07:16:57,195 [CallableQueue-1] DEBUG CoordActionInputCheckXCommand:526 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[000-180529071448012-oozie-root-C] ACTION[000-180529071448012-oozie-root-C@1] Released lock for [000-180529071448012-oozie-root-C] in [coord_action_input] {noformat} > Flaky tests in TestCoordInputLogicPush class > > > Key: OOZIE-3271 > URL: https://issues.apache.org/jira/browse/OOZIE-3271 > Project: Oozie > Issue Type: Sub-task >Reporter: Peter Bacsko >Priority: Major > > Running locally, two tests in {{TestCoordInputLogicPush}} failed: > {noformat} > testLatestRange(org.apache.oozie.coord.input.logic.TestCoordInputLogicPush) > Time elapsed: 132.734 s <<< FAILURE! > junit.framework.AssertionFailedError: Action status should not be waiting > at junit.framework.Assert.fail(Assert.java:57) > at junit.framework.Assert.assertTrue(Assert.java:22) > at junit.framework.Assert.assertFalse(Assert.java:39) > at junit.framework.TestCase.assertFalse(TestCase.java:210) > at > org.apache.oozie.coord.input.logic.TestCoordInputLogicPush.startCoordAction(TestCoordInputLogicPush.java:570) > at > org.apache.oozie.coord.input.logic.TestCoordInputLogicPush.testLatestRange(TestCoordInputLogicPush.java:255) > {noformat} > {noformat} > testLatestRangeComplex(org.apache.oozie.coord.input.logic.TestCoordInputLogicPush) > Time elapsed: 159.055 s <<< FAILURE! > junit.framework.AssertionFailedError: Action status should not be waiting > at junit.framework.Assert.fail(Assert.java:57)
[jira] [Commented] (OOZIE-3156) SSH action status turns OK wrongly when failed to connect to host
[ https://issues.apache.org/jira/browse/OOZIE-3156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496518#comment-16496518 ] TIAN XING commented on OOZIE-3156: -- Hey [~andras.piros], thanks for the review. In \{{TestSshActionExecutor#testSshCheckWithHostConnectFailure()}}, I copy the code from \{{TestSshActionExecutor#testJobStart}} which gives us an example ends with OK status. In oder to create a "SSH connection failure" situation, I changed action's \{{TrackerUri}} from "\{{@localhost}}" to "\{{dummy@dummyHost}}" during action status check. An exception is expected to be thrown out, while before this patch, the check method will execute normally and end with OK status. Do you have any better suggestions on how to design such test case? Thanks! > SSH action status turns OK wrongly when failed to connect to host > - > > Key: OOZIE-3156 > URL: https://issues.apache.org/jira/browse/OOZIE-3156 > Project: Oozie > Issue Type: Bug > Components: action >Affects Versions: 5.0.0 >Reporter: TIAN XING >Assignee: TIAN XING >Priority: Major > Attachments: OOZIE-3156-v1.patch, OOZIE-3156-v2.patch, > OOZIE-3156-v3.patch, ssh-check-bug.patch > > > When {{check()}} method of {{SshActionExecutor}} gets invoked, oozie will ssh > connect to the host and check whether the pid of the process that ssh action > started is still there (by checking the returned value of command "{{ssh > ps -p }}" ) to determine whether ssh action completes or not. > However, we found cases where oozie fails to connect to host during action > status check (e.g., the host is under heavy load, or network is bad etc.). > In such cases, the return value of command "{{ssh ps -p }}" > will be 255 (ssh command exits with the exit status of the remote command or > with 255 if an error occurred.). > According the current logic of method {{getActionStatus()}} in > {{SshActionExecutor}}, the action status will be determined as OK which may > not be correct. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (OOZIE-3260) [sla] Remove stale item above max retries on JPA related errors from in-memory SLA map
[ https://issues.apache.org/jira/browse/OOZIE-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496564#comment-16496564 ] Andras Piros commented on OOZIE-3260: - [~gezapeti] can you please review? Thanks! > [sla] Remove stale item above max retries on JPA related errors from > in-memory SLA map > -- > > Key: OOZIE-3260 > URL: https://issues.apache.org/jira/browse/OOZIE-3260 > Project: Oozie > Issue Type: Bug > Components: coordinator, core, workflow >Affects Versions: 5.0.0 >Reporter: Andras Piros >Assignee: Andras Piros >Priority: Major > Attachments: OOZIE-3260.001.patch > > > Despite having implemented OOZIE-3134, there are still cases where > {{SLACalculatorMemory#slaMap}} and database contents still get out of sync. > Some possibilities including but not limited to: > * database contents of {{SLA_SUMMARY}} table have been purged manually from > DB > * no corresponding {{WF_JOBS}} or {{COORD_JOBS}} entries exist anymore in DB > * the {{WF_JOBS}} or {{COORD_JOBS}} instance that is being tracked by the > {{SLACalcStatus}} instances inside {{SLACalculatorMemory#slaMap}} is not yet > persisted to database when the SLA entry is already processed by > {{SLACalculatorMemory.HistoryPurgeWorker}}. Depending on e.g. how many > coordinator actions are being materialized, it can very well happen that > {{SLACalcStatus}} entries inserted to the in-memory map will be processed > before their corresponding {{CoordActionBean}} entries are yet to be > persisted to database > In those rare cases, we see {{JPAExecutorException}} instances like: > {noformat} > 2017-10-09 17:00:18,185 DEBUG openjpa.jdbc.SQL: SERVER[HOST] conn 1584126245> [0 ms] spent > 2017-10-09 17:00:18,185 ERROR org.apache.oozie.sla.SLACalculatorMemory: > SERVER[tplhc01c001.iuser.iroot.adidom.com] USER[-] GROUP[-] TOKEN[-] APP[-] > JOB[438-170916014916144-oozie-oozi-C@556] ACTION[-] Exception in SLA > processing for job [438-170916014916144-oozie-oozi-C@556] > org.apache.oozie.executor.jpa.JPAExecutorException: E0604: Job does not exist > [select w.eventProcessed from SLASummaryBean w where w.jobId = :id] > at > org.apache.oozie.executor.jpa.SLASummaryQueryExecutor.getSingleValue(SLASummaryQueryExecutor.java:161) > at > org.apache.oozie.sla.SLACalculatorMemory.updateJobSla(SLACalculatorMemory.java:480) > at > org.apache.oozie.sla.SLACalculatorMemory.updateAllSlaStatus(SLACalculatorMemory.java:601) > {noformat} > Solution here is to track the number of times the {{SLACalcStatus}} entry has > not been processed successfully, and when a preconfigured > {{oozie.sla.service.SLAService.maximum.retry.count}} is reached, remove any > {{SLACalculatorMemory#slaMap}} entries that are causing those > {{JPAExecutorException}} instances, to not cause huge logfiles. The items to > be logged don't exist, anyways. > It's still possible that multiple {{CoordActionBean}} instances being > inserted won't have {{SLACalcStatus}} entries inside > {{SLACalculatorMemory#slaMap}} by the time written to database, and thus, no > SLA will be tracked. In those rare cases, preconfigured maximum retry count > can be extended. > Note that current implementation of > [*{{SLACalculatorMemory#updateJobSla()}}*|https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/sla/SLACalculatorMemory.java#L238-L242] > already removes the stale {{SLACalcStatus}} entry. The new functionality > here is to introduce {{SLACalcStatus#retryCount}}, and extend the > {{JPAExecutorException}} {{ErrorCode}}s of interest. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (OOZIE-3271) Flaky tests in TestCoordInputLogicPush class
Peter Bacsko created OOZIE-3271: --- Summary: Flaky tests in TestCoordInputLogicPush class Key: OOZIE-3271 URL: https://issues.apache.org/jira/browse/OOZIE-3271 Project: Oozie Issue Type: Sub-task Reporter: Peter Bacsko Running locally, two tests in {{TestCoordInputLogicPush}} failed: {noformat} testLatestRange(org.apache.oozie.coord.input.logic.TestCoordInputLogicPush) Time elapsed: 132.734 s <<< FAILURE! junit.framework.AssertionFailedError: Action status should not be waiting at junit.framework.Assert.fail(Assert.java:57) at junit.framework.Assert.assertTrue(Assert.java:22) at junit.framework.Assert.assertFalse(Assert.java:39) at junit.framework.TestCase.assertFalse(TestCase.java:210) at org.apache.oozie.coord.input.logic.TestCoordInputLogicPush.startCoordAction(TestCoordInputLogicPush.java:570) at org.apache.oozie.coord.input.logic.TestCoordInputLogicPush.testLatestRange(TestCoordInputLogicPush.java:255) {noformat} {noformat} testLatestRangeComplex(org.apache.oozie.coord.input.logic.TestCoordInputLogicPush) Time elapsed: 159.055 s <<< FAILURE! junit.framework.AssertionFailedError: Action status should not be waiting at junit.framework.Assert.fail(Assert.java:57) at junit.framework.Assert.assertTrue(Assert.java:22) at junit.framework.Assert.assertFalse(Assert.java:39) at junit.framework.TestCase.assertFalse(TestCase.java:210) at org.apache.oozie.coord.input.logic.TestCoordInputLogicPush.startCoordAction(TestCoordInputLogicPush.java:570) at org.apache.oozie.coord.input.logic.TestCoordInputLogicPush.testLatestRangeComplex(TestCoordInputLogicPush.java:330) {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (OOZIE-3015) Documentation: public part (TWiki)
[ https://issues.apache.org/jira/browse/OOZIE-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andras Piros updated OOZIE-3015: Summary: Documentation: public part (TWiki) (was: Documentation: public part (TWiki and KnowledgeBase)) > Documentation: public part (TWiki) > -- > > Key: OOZIE-3015 > URL: https://issues.apache.org/jira/browse/OOZIE-3015 > Project: Oozie > Issue Type: Sub-task > Components: client >Reporter: Daniel Becker >Assignee: Daniel Becker >Priority: Major > Fix For: 5.1.0 > > Original Estimate: 32h > Remaining Estimate: 32h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (OOZIE-3270) Upgrade Derby to 10.14.1.0
[ https://issues.apache.org/jira/browse/OOZIE-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Cseh updated OOZIE-3270: -- Attachment: OOZIE-3270.01.patch > Upgrade Derby to 10.14.1.0 > --- > > Key: OOZIE-3270 > URL: https://issues.apache.org/jira/browse/OOZIE-3270 > Project: Oozie > Issue Type: Improvement >Reporter: Peter Cseh >Assignee: Peter Cseh >Priority: Major > Attachments: OOZIE-3270.01.patch > > > We should upgrade Derby to 10.14.1.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (OOZIE-3015) Documentation: public part (TWiki)
[ https://issues.apache.org/jira/browse/OOZIE-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andras Piros resolved OOZIE-3015. - Resolution: Fixed Fix Version/s: 5.1.0 > Documentation: public part (TWiki) > -- > > Key: OOZIE-3015 > URL: https://issues.apache.org/jira/browse/OOZIE-3015 > Project: Oozie > Issue Type: Sub-task > Components: client >Reporter: Daniel Becker >Assignee: Daniel Becker >Priority: Major > Fix For: 5.1.0 > > Original Estimate: 32h > Remaining Estimate: 32h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)