[ http://issues.apache.org/jira/browse/NUTCH-368?page=all ]
Andrzej Bialecki updated NUTCH-368:
------------------------------------
Attachment: Fetcher-ctrl.patch
This patch uses the message queueing framework to implement the following
functionality in Fetcher:
* ability to gracefully stop fetching the current segment. This is different
from simply killing the job in that the partial results (partially fetched
segment) are available and can be further processed. This is especially useful
for fetching large segments with long "tails", i.e. pages which are fetched
very slowly, either because of politeness settings or the target site's
bandwidth limitations.
* ability to dynamicaly adjust the number of fetcher threads. For a
long-running fetch job it makes sense to decrease the number of fetcher threads
during the day, and increase it during the night. This can be done now with a
cron script, using the MsgQueueTool command-line.
It's worthwhile to note that the patch itself is trivial, and most of the work
is done by the MQ framework.
After you apply this patch you can start a long-running fetcher job, check its
<jobId>, and control the fetcher this way:
bin/nutch org.apache.nutch.util.msg.MsgQueueTool -createMsg <job_id> ctrl
THREADS 50
This adjusts the number of threads to 50 (starting more threads or stopping
some threads as necessary).
Then run:
bin/nutch org.apache.nutch.util.msg.MsgQueueTool -createMsg <job_id> ctrl
HALT
This will gracefully shut down all threads after they finish fetching their
current url, and finish the job, keeping the partial segment data intact.
> Message queueing system
> -----------------------
>
> Key: NUTCH-368
> URL: http://issues.apache.org/jira/browse/NUTCH-368
> Project: Nutch
> Issue Type: New Feature
> Affects Versions: 0.9.0
> Reporter: Andrzej Bialecki
> Assigned To: Andrzej Bialecki
> Attachments: Fetcher-ctrl.patch, msg.tgz
>
>
> This is an implementation of a filesystem-based message queueing system. The
> motivation for this functionality is explained in HADOOP-490 - there is
> nothing Nutch-specific in this implementation, so if it's considered
> generally useful it could be moved there.
> Below are excerpts from the included javadocs.
> The model of the system is as follows:
> * applications (including map-reduce jobs) may create their own separate
> message queueing area. Alternatively, they can specifically ask for a named
> message queue, belonging to a different application or existing as a
> system-wide queue. Message queues are created under "/mq" and then the
> message queue id (for map-reduce jobs this is a job id, or it can be any
> other name passed as job id to the constructor).
> Please see the example for more information.
> * a single unit of information passing through queues is a Msg, which has
> a unique identifier (consisting of creation time and publisher name), string
> subject, and content (Writable).
> * single MsgQueue in fact consists of any number of topics. There are
> four predefined ones: in, out, err, and ctrl.
> * messages are published to topics, which present a sequential view of
> messages, sorted by msgId (which corresponds to their order of arrival).
> * each message queue may periodically poll for changes
> (MsgQueue.startPolling()), using a separate thread. Polling updates the list
> of topics and messages. Poll interval is configurable, and defaults to 5 sec.
> * each detected change in the queue (add/remove topic, add/remove
> message) may be communicated to registered listeners. Out-of-band messages
> are not supported in this version, but it's not too complicated to add them.
> Applications can create listeners watching queues for newly added messages,
> or deleted messages, added topics or deleted topics, etc.
> * each instance of MsgQueue using the same physical queue maintains its
> own view of the queue, keeping track of topics and messages that it considers
> "processed and discarded". In other words, multiple readers and creators may
> modify queues, and each knows which messages it already processed and which
> ones are new. In a similar fashion, instances may willfully "remove" certain
> topics from their view, even though these topics still physically exist and
> are available for other instances (and later on they can "add" them to their
> view again).
> This somewhat complicated feature was implemented in order to support
> multiple readers for the same message (e.g. many tasks per one mapred job).
> Each task needs to register for the same queue, and if they didn't have their
> own views of the queue, messages would be consumed by the first task that got
> to them. As it is implemented now, each task may consume messages at its own
> pace. At the end of the job applications may elect to keep the queue around
> or to destroy it (and thus remove all topics and messages in it).
> * messages, topics and queues may be destroyed by any user, at which
> point they are physically removed from the filesystem. All users will
> gradually update their views, during the next poll operation.
> * there is a command-line tool to examine and modify queues, and also to
> retrieve and send simple text messages. You can run it like this:
> bin/nutch org.apache.nutch.util.msg.MsgQueueTool ...many options...
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers