[ https://issues.apache.org/jira/browse/HDFS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005327#comment-13005327 ]
Mikhail Yakshin commented on HDFS-1742: --------------------------------------- I seriously doubt that making pubsub-like event transmission as the *only* available option is the way to go. Pubsub model is a cool thing, but proper implementation of it requires full-blown messaging subsystem akin to ones that implement [JMS|http://en.wikipedia.org/wiki/Java_Message_Service], such as [ActiveMQ|http://activemq.apache.org/]. In turn, it means a whole other system, matching Hadoop by complexity (it includes demons, at least a JMS broker, and it requires non-trivial configuration and deployment), being installed and made mandatory by Hadoop. The only thing I try to argue about is making this thing *modular* - i.e. making JMS pubsub producer *an option*, but not the *only* option. Other options might be simple local file logging, sending them across the network, plugging some local workflow management system, etc, etc. > Provide hooks / callbacks to execute some code based on events happening in > HDFS (file / directory creation, opening, closing, etc) > ----------------------------------------------------------------------------------------------------------------------------------- > > Key: HDFS-1742 > URL: https://issues.apache.org/jira/browse/HDFS-1742 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node > Reporter: Mikhail Yakshin > Labels: features, polling > > We're working on a system that runs various Hadoop job continuously, based on > the data that appears in HDFS: for example, we have a job that works on day's > worth of data and creates output in {{/output/YYYY/MM/DD}}. For input, it > should wait for directory with externally uploaded data as > {{/input/YYYY/MM/DD}} to appear, and also wait for previous day's data to > appear, i.e. {{/output/YYYY/MM/DD-1}}. > Obviously, one of the possible solutions is polling once in a while for > files/directories we're waiting for, but generally it's a bad solution. The > better one is something like file alteration monitor or [inode activity > notifiers|http://en.wikipedia.org/wiki/Inotify], such as ones implemented in > Linux filesystems. > Basic idea is that one can specify (inject) some code that will be executed > on every major event happening in HDFS, such as: > * File created / open > * File closed > * File deleted > * Directory created > * Directory deleted > I see simplistic implementation as following: NN defines some interfaces that > implement callback/hook mechanism - i.e. something like: > {code} > interface NameNodeCallback { > public void onFileCreate(SomeFileInformation f); > public void onFileClose(SomeFileInformation f); > public void onFileDelete(SomeFileInformation f); > ... > } > {code} > It might be possible to creates a class that implements this method and load > it somehow (for example, using an extra jar in classpath) in NameNode's JVM. > NameNode includes a configuration option that specifies names of such > class(es) - then NameNode instantiates them and calls methods from them (in a > separate thread) on every valid event happening. > There would be a couple of ready-made pluggable implementations of such a > class that would be most likely distributed as contrib. Default NameNode's > process would stay the same without any visible differences. > Hadoop's JobTracker already extensively uses the same paradigm with pluggable > Scheduler interfaces, such as [Fair > Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/fairscheduler], > [Capacity > Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/capacity-scheduler], > [Dynamic > Scheduler|https://github.com/apache/hadoop/tree/trunk/src/contrib/dynamic-scheduler], > etc. It also uses a class(es) that loads and runs inside JobTracker's > context, few relatively trustued varieties exist, they're distributed as > contrib and purely optional to be enabled by cluster admin. > This would allow systems such as I've described in the beginning to be > implemented without polling. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira