directory creation, opening, closing, etc)

Mikhail Yakshin (JIRA) Thu, 10 Mar 2011 10:05:26 -0800

     [ 
https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mikhail Yakshin updated HDFS-1742:
----------------------------------

    Description: 
We're working on a system that runs various Hadoop job continuously, based on 
the data that appears in HDFS: for example, we have a job that works on day's 
worth of data and creates output in {{/output/YYYY/MM/DD}}. For input, it 
should wait for directory with externally uploaded data as 
{{/input/YYYY/MM/DD}} to appear, and also wait for previous day's data to 
appear, i.e. {{/output/YYYY/MM/DD-1}}.

Obviously, one of the possible solutions is polling once in a while for 
files/directories we're waiting for, but generally it's a bad solution. The 
better one is something like file alteration monitor or [inode activity 
notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in 
Linux filesystems.

Basic idea is that one can specify (inject) some code that will be executed on 
every major event happening in HDFS, such as:
* File created / open
* File closed
* File deleted
* Directory created
* Directory deleted

I see simplistic implementation as following: NN defines some interfaces that 
implement callback/hook mechanism - i.e. something like:

{code}
interface NameNodeCallback {
    public void onFileCreate(SomeFileInformation f);
    public void onFileClose(SomeFileInformation f);
    public void onFileDelete(SomeFileInformation f);
    ...
}
{code}

It might be possible to creates a class that implements this method and load it 
somehow (for example, using an extra jar in classpath) in NameNode's JVM. 
NameNode includes a configuration option that specifies names of such class(es) 
- then NameNode instantiates them and calls methods from them (in a separate 
thread) on every valid event happening.

There would be a couple of ready-made pluggable implementations of such a class 
that would be most likely distributed as contrib. Default NameNode's process 
would stay the same without any visible differences.

Hadoop's JobTracker already extensively uses the same paradigm with pluggable 
Scheduler interfaces, such as [Fair 
Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/fairscheduler],
 [Capacity 
Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/capacity-scheduler],
 [Dynamic 
Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/dynamic-scheduler],
 etc. It also uses a class(es) that loads and runs inside JobTracker's context, 
few relatively trustued varieties exist, they're distributed as contrib and 
purely optional to be enabled by cluster admin.

This would allow systems such as I've described in the beginning to be 
implemented without polling.

  was:
We're working on a system that runs various Hadoop job continuously, based on 
the data that appears in HDFS: for example, we have a job that works on day's 
worth of data and creates output in {{/output/YYYY/MM/DD}}. For input, it 
should wait for directory with externally uploaded data as 
{{/input/YYYY/MM/DD}} to appear, and also wait for previous day's data to 
appear, i.e. {{/output/YYYY/MM/DD-1}}.

Obviously, one of the possible solutions is polling once in a while for 
files/directories we're waiting for, but generally it's a bad solution. The 
better one is something like file alteration monitor or [inode activity 
notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in 
Linux filesystems.

Basic idea is that one can specify (inject) some code that will be executed on 
every major event happening in HDFS, such as:
* File created / open
* File closed
* File deleted
* Directory created
* Directory deleted

I see simplistic implementation as following: NN defines some interfaces that 
implement callback/hook mechanism - i.e. something like:

{code}
interface NameNodeCallback {
    public void onFileCreate(SomeFileInformation f);
    public void onFileClose(SomeFileInformation f);
    public void onFileDelete(SomeFileInformation f);
    ...
}
{code}

It might be possible to creates a class that implements this method and load it 
somehow (for example, using an extra jar in classpath) in NameNode's JVM. 
NameNode includes a configuration option that specifies names of such class(es) 
- then NameNode instantiates them and calls methods from them (in a separate 
thread) on every valid event happening.

There would be a couple of ready-made pluggable implementations of such a class 
that would be most likely distributed as contrib. Default NameNode's process 
would stay the same without any visible differences.

Hadoop's JobTracker already extensively uses the same paradigm with pluggable 
Scheduler interfaces, such as [Fair 
Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/fairscheduler],
 [Capacity 
Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/capacity-scheduler],
 [Dynamic 
Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/dynamic-scheduler],
 etc. It also uses a class(es) that loads and runs inside JobTracker's context 

This would allow systems such as I've described in the beginning to be 
implemented without polling.


> Provide hooks / callbacks to execute some code based on events happening in 
> HDFS (file / directory creation, opening, closing, etc)
> -----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-1742
>                 URL: https://issues.apache.org/jira/browse/HDFS-1742
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: name-node
>            Reporter: Mikhail Yakshin
>              Labels: features, polling
>
> We're working on a system that runs various Hadoop job continuously, based on 
> the data that appears in HDFS: for example, we have a job that works on day's 
> worth of data and creates output in {{/output/YYYY/MM/DD}}. For input, it 
> should wait for directory with externally uploaded data as 
> {{/input/YYYY/MM/DD}} to appear, and also wait for previous day's data to 
> appear, i.e. {{/output/YYYY/MM/DD-1}}.
> Obviously, one of the possible solutions is polling once in a while for 
> files/directories we're waiting for, but generally it's a bad solution. The 
> better one is something like file alteration monitor or [inode activity 
> notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in 
> Linux filesystems.
> Basic idea is that one can specify (inject) some code that will be executed 
> on every major event happening in HDFS, such as:
> * File created / open
> * File closed
> * File deleted
> * Directory created
> * Directory deleted
> I see simplistic implementation as following: NN defines some interfaces that 
> implement callback/hook mechanism - i.e. something like:
> {code}
> interface NameNodeCallback {
>     public void onFileCreate(SomeFileInformation f);
>     public void onFileClose(SomeFileInformation f);
>     public void onFileDelete(SomeFileInformation f);
>     ...
> }
> {code}
> It might be possible to creates a class that implements this method and load 
> it somehow (for example, using an extra jar in classpath) in NameNode's JVM. 
> NameNode includes a configuration option that specifies names of such 
> class(es) - then NameNode instantiates them and calls methods from them (in a 
> separate thread) on every valid event happening.
> There would be a couple of ready-made pluggable implementations of such a 
> class that would be most likely distributed as contrib. Default NameNode's 
> process would stay the same without any visible differences.
> Hadoop's JobTracker already extensively uses the same paradigm with pluggable 
> Scheduler interfaces, such as [Fair 
> Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/fairscheduler],
>  [Capacity 
> Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/capacity-scheduler],
>  [Dynamic 
> Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/dynamic-scheduler],
>  etc. It also uses a class(es) that loads and runs inside JobTracker's 
> context, few relatively trustued varieties exist, they're distributed as 
> contrib and purely optional to be enabled by cluster admin.
> This would allow systems such as I've described in the beginning to be 
> implemented without polling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HDFS-1742) Provide hooks / callbacks to execute some code based on events happening in HDFS (file / directory creation, opening, closing, etc)

Reply via email to