[ http://issues.apache.org/jira/browse/NUTCH-368?page=comments#action_12435710 ] Andrzej Bialecki commented on NUTCH-368: -----------------------------------------
> IMO a place for stuff like this is in hadoop more than nutch and i would like > to see this implemented there. Agreed. I needed this to support certain Nutch extensions (e.g. gracefully stopping long-running jobs, adjusting bandwidth throttling on a running fetcher, etc), and I didn't want to wait until Nutch catches with that version of Hadoop (if it were ever accepted there). > Also have you considered using something readily available instead of > implementing (well that part is done allready:) and taking the burden of > maintaining it. I'd gladly do so, however I couldn't find anything like that, which was not at the same time a JMS-compliant stack (with one exception which was GPL-ed). I didn't want to bring the whole weight and complexity of J2EE, and I didn't want to require a separate database for persistence (yet another point of failure). This API uses the persistance, redundancy, scalability and communication mechanisms of Hadoop, so the most complex parts of JMS I'm getting for free .. ;) the rest is relatively simple. > Message queueing system > ----------------------- > > Key: NUTCH-368 > URL: http://issues.apache.org/jira/browse/NUTCH-368 > Project: Nutch > Issue Type: New Feature > Affects Versions: 0.9.0 > Reporter: Andrzej Bialecki > Assigned To: Andrzej Bialecki > Attachments: msg.tgz > > > This is an implementation of a filesystem-based message queueing system. The > motivation for this functionality is explained in HADOOP-490 - there is > nothing Nutch-specific in this implementation, so if it's considered > generally useful it could be moved there. > Below are excerpts from the included javadocs. > The model of the system is as follows: > * applications (including map-reduce jobs) may create their own separate > message queueing area. Alternatively, they can specifically ask for a named > message queue, belonging to a different application or existing as a > system-wide queue. Message queues are created under "/mq" and then the > message queue id (for map-reduce jobs this is a job id, or it can be any > other name passed as job id to the constructor). > Please see the example for more information. > * a single unit of information passing through queues is a Msg, which has > a unique identifier (consisting of creation time and publisher name), string > subject, and content (Writable). > * single MsgQueue in fact consists of any number of topics. There are > four predefined ones: in, out, err, and ctrl. > * messages are published to topics, which present a sequential view of > messages, sorted by msgId (which corresponds to their order of arrival). > * each message queue may periodically poll for changes > (MsgQueue.startPolling()), using a separate thread. Polling updates the list > of topics and messages. Poll interval is configurable, and defaults to 5 sec. > * each detected change in the queue (add/remove topic, add/remove > message) may be communicated to registered listeners. Out-of-band messages > are not supported in this version, but it's not too complicated to add them. > Applications can create listeners watching queues for newly added messages, > or deleted messages, added topics or deleted topics, etc. > * each instance of MsgQueue using the same physical queue maintains its > own view of the queue, keeping track of topics and messages that it considers > "processed and discarded". In other words, multiple readers and creators may > modify queues, and each knows which messages it already processed and which > ones are new. In a similar fashion, instances may willfully "remove" certain > topics from their view, even though these topics still physically exist and > are available for other instances (and later on they can "add" them to their > view again). > This somewhat complicated feature was implemented in order to support > multiple readers for the same message (e.g. many tasks per one mapred job). > Each task needs to register for the same queue, and if they didn't have their > own views of the queue, messages would be consumed by the first task that got > to them. As it is implemented now, each task may consume messages at its own > pace. At the end of the job applications may elect to keep the queue around > or to destroy it (and thus remove all topics and messages in it). > * messages, topics and queues may be destroyed by any user, at which > point they are physically removed from the filesystem. All users will > gradually update their views, during the next poll operation. > * there is a command-line tool to examine and modify queues, and also to > retrieve and send simple text messages. You can run it like this: > bin/nutch org.apache.nutch.util.msg.MsgQueueTool ...many options... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
