[
https://issues.apache.org/jira/browse/OOZIE-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14804733#comment-14804733
]
Purshotam Shah commented on OOZIE-1976:
---------------------------------------
Uploaded patch to RB. Some refactoring and naming changes are still pending.
Patch has core logic.
There are three components in this patch
1. User interface
A new tag is added to coordinator.xml
ex.
<input-check>
<or name="test">
<and>
<data-in dataset="A"/>"
<data-in dataset="B"/>
</and>
<and>
<data-in dataset="C"/>
<data-in dataset="D"/>
</and>
</or>;
<input-check>
input-check will have nested and/or/combine operation. It can have min and wait
at operator or at date-in.
If input-check tag is missing then it consider to be old approach where all
data dependency are needed.
2. Processing
input-check is converted into logical expression
(a&&B)||(c&&d)
We use jexl to parse the logical expression.
There are three phase in parsing.
phase 1 : Only resolved dataset are parsed ( only current).
phase 2 : Once all current are resolved, then future/latest are parsed.
phase 3 : Doesn't do any filecheck, just return what is being parsed by phase1
and phase2. Is used for EL functions
3. Storage.
If inputcheck is enable, push_missing_dependencies and missing_dependencies are
serialized and stored in DB.
If then not then it's old approach, where they are stored in plan text. This is
backward compatible.
> Specifying coordinator input datasets in more logical ways
> ----------------------------------------------------------
>
> Key: OOZIE-1976
> URL: https://issues.apache.org/jira/browse/OOZIE-1976
> Project: Oozie
> Issue Type: New Feature
> Components: coordinator
> Affects Versions: trunk
> Reporter: Mona Chitnis
> Assignee: Purshotam Shah
> Fix For: trunk
>
> Attachments: Input-check.docx, OOZIE-1976-WIP.patch,
> OOZIE-1976-rough-design-2.pdf, OOZIE-1976-rough-design.pdf
>
>
> All dataset instances specified as input to coordinator, currently work on
> AND logic i.e. ALL of them should be available for workflow to start. We
> should enhance this to include more logical ways of specifying availability
> criteria e.g.
> * OR between instances
> * minimum N out of K instances
> * delta datasets (process data incrementally)
> Use-cases for this:
> * Different datasets are BCP, and workflow can run with either, whichever
> arrives earlier.
> * Data is not guaranteed, and while $coord:latest allows skipping to
> available ones, workflow will never trigger unless mentioned number of
> instances are found.
> * Workflow is like a ‘refining’ algorithm which should run after minimum
> required datasets are ready, and should only process the delta for efficiency.
> This JIRA is to discuss the design and then the review the implementation for
> some or all of the above features.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)