[ http://issues.apache.org/jira/browse/NUTCH-368?page=comments#action_12435555 ] Sami Siren commented on NUTCH-368: ----------------------------------
IMO a place for stuff like this is in hadoop more than nutch and i would like to see this implemented there. Mainly because i see it more as part of distributed architecture (that hadoop is providing) than a search engine specialized functionality (that nutch is providing). Also have you considered using something readily available instead of implementing (well that part is done allready:) and taking the burden of maintaining it. > Message queueing system > ----------------------- > > Key: NUTCH-368 > URL: http://issues.apache.org/jira/browse/NUTCH-368 > Project: Nutch > Issue Type: New Feature > Affects Versions: 0.9.0 > Reporter: Andrzej Bialecki > Assigned To: Andrzej Bialecki > Attachments: msg.tgz > > > This is an implementation of a filesystem-based message queueing system. The > motivation for this functionality is explained in HADOOP-490 - there is > nothing Nutch-specific in this implementation, so if it's considered > generally useful it could be moved there. > Below are excerpts from the included javadocs. > The model of the system is as follows: > * applications (including map-reduce jobs) may create their own separate > message queueing area. Alternatively, they can specifically ask for a named > message queue, belonging to a different application or existing as a > system-wide queue. Message queues are created under "/mq" and then the > message queue id (for map-reduce jobs this is a job id, or it can be any > other name passed as job id to the constructor). > Please see the example for more information. > * a single unit of information passing through queues is a Msg, which has > a unique identifier (consisting of creation time and publisher name), string > subject, and content (Writable). > * single MsgQueue in fact consists of any number of topics. There are > four predefined ones: in, out, err, and ctrl. > * messages are published to topics, which present a sequential view of > messages, sorted by msgId (which corresponds to their order of arrival). > * each message queue may periodically poll for changes > (MsgQueue.startPolling()), using a separate thread. Polling updates the list > of topics and messages. Poll interval is configurable, and defaults to 5 sec. > * each detected change in the queue (add/remove topic, add/remove > message) may be communicated to registered listeners. Out-of-band messages > are not supported in this version, but it's not too complicated to add them. > Applications can create listeners watching queues for newly added messages, > or deleted messages, added topics or deleted topics, etc. > * each instance of MsgQueue using the same physical queue maintains its > own view of the queue, keeping track of topics and messages that it considers > "processed and discarded". In other words, multiple readers and creators may > modify queues, and each knows which messages it already processed and which > ones are new. In a similar fashion, instances may willfully "remove" certain > topics from their view, even though these topics still physically exist and > are available for other instances (and later on they can "add" them to their > view again). > This somewhat complicated feature was implemented in order to support > multiple readers for the same message (e.g. many tasks per one mapred job). > Each task needs to register for the same queue, and if they didn't have their > own views of the queue, messages would be consumed by the first task that got > to them. As it is implemented now, each task may consume messages at its own > pace. At the end of the job applications may elect to keep the queue around > or to destroy it (and thus remove all topics and messages in it). > * messages, topics and queues may be destroyed by any user, at which > point they are physically removed from the filesystem. All users will > gradually update their views, during the next poll operation. > * there is a command-line tool to examine and modify queues, and also to > retrieve and send simple text messages. You can run it like this: > bin/nutch org.apache.nutch.util.msg.MsgQueueTool ...many options... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
