[jira] [Commented] (OOZIE-1976) Specifying coordinator input datasets in more logical ways

Purshotam Shah (JIRA) Thu, 17 Sep 2015 17:20:06 -0700

    [ 
https://issues.apache.org/jira/browse/OOZIE-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14804733#comment-14804733
 ]


Purshotam Shah commented on OOZIE-1976:
---------------------------------------

Uploaded patch to RB. Some refactoring and naming changes are still pending. 
Patch has core logic.

There are three components in this patch

1. User interface
A new tag is added to coordinator.xml
ex.
<input-check>
    <or name="test">
                  <and>
                          <data-in dataset="A"/>"
                          <data-in dataset="B"/>
                   </and>
                   <and>
                          <data-in dataset="C"/>
                          <data-in dataset="D"/>
                   </and>
                   
         </or>;
<input-check>


input-check will have nested and/or/combine operation. It can have min and wait 
at operator or at date-in.
If input-check tag is missing then it consider to be old approach where all 
data dependency are needed.

2. Processing
input-check is converted into logical expression
        (a&&B)||(c&&d)
We use jexl to parse the logical expression.

There are three phase in parsing.
phase 1 : Only resolved dataset are parsed ( only current).     
phase 2 : Once all current are resolved, then future/latest are parsed.
phase 3 : Doesn't do any filecheck, just return what is being parsed by phase1 
and phase2. Is used for EL functions


3. Storage.
If inputcheck is enable, push_missing_dependencies and missing_dependencies are 
serialized and stored in DB.
If then not then it's old approach, where they are stored in plan text. This is 
backward compatible. 

> Specifying coordinator input datasets in more logical ways
> ----------------------------------------------------------
>
>                 Key: OOZIE-1976
>                 URL: https://issues.apache.org/jira/browse/OOZIE-1976
>             Project: Oozie
>          Issue Type: New Feature
>          Components: coordinator
>    Affects Versions: trunk
>            Reporter: Mona Chitnis
>            Assignee: Purshotam Shah
>             Fix For: trunk
>
>         Attachments: Input-check.docx, OOZIE-1976-WIP.patch, 
> OOZIE-1976-rough-design-2.pdf, OOZIE-1976-rough-design.pdf
>
>
> All dataset instances specified as input to coordinator, currently work on 
> AND logic i.e. ALL of them should be available for workflow to start. We 
> should enhance this to include more logical ways of specifying availability 
> criteria e.g.
>  * OR between instances
>  * minimum N out of K instances
>  * delta datasets (process data incrementally)
> Use-cases for this:
>  * Different datasets are BCP, and workflow can run with either, whichever 
> arrives earlier.
>  * Data is not guaranteed, and while $coord:latest allows skipping to 
> available ones, workflow will never trigger unless mentioned number of 
> instances are found.
>  * Workflow is like a ‘refining’ algorithm which should run after minimum 
> required datasets are ready, and should only process the delta for efficiency.
> This JIRA is to discuss the design and then the review the implementation for 
> some or all of the above features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (OOZIE-1976) Specifying coordinator input datasets in more logical ways

Reply via email to