[
https://issues.apache.org/jira/browse/OOZIE-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518199#comment-14518199
]
Robert Kanter commented on OOZIE-2216:
--------------------------------------
[~jaydeepvishwakarma], having this feature would be really great, and something
that's been on the "To Do" list for a long time. That said, it's not an easy
task (which is why it hasn't been done yet :)). The design document looks okay
at a high level, but I'm not sure I understand how Oozie will "know" when the
next data arrives. Will Oozie be doing very frequent polling of HDFS to check
for new data? If so, I'm not sure the NN is going to like that. One thing
that might be helpful is HDFS's iNotify feature, which, from my understanding,
will send out a notification so we don't have to poll HDFS. I had actually
created OOZIE-2179 for doing that regardless. Perhaps you can take advantage
of iNotify for checking for new data?
> Aperiodic Data handling in oozie
> --------------------------------
>
> Key: OOZIE-2216
> URL: https://issues.apache.org/jira/browse/OOZIE-2216
> Project: Oozie
> Issue Type: New Feature
> Components: coordinator
> Reporter: Jaydeep Vishwakarma
> Assignee: Jaydeep Vishwakarma
> Attachments: Oozie_aperiodic_data_handling.pdf
>
>
> Currently Oozie scheduling works on periodic datasets. It does not have any
> mechanism to handle aperiodic datasets, which doesn’t follow a fixed
> schedule/frequency.
> Use cases
> When incoming dataset arrives with no fixed schedule.
> Need to trigger the job based all data available since last run with a
> possible cap on the max size to process in one run.
> Try to avoid creating so many instances when you know input instances will be
> very few.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)