Hi all, for NUTCH-251:
I suppose that NUTCH-251 is relatively a significant issue by the votes.
Stafan has written a good plugin for the admin gui and i have updated it
to work with nutch-0.8, hadoop 0.4.
Some of the features in the patch is not appropriate for our use cases
and it requires hadoop changes, thus I am currently working on an
alternative implementation of the administration gui, which runs a
hadoop server( like JobTraker) to listen to submitted Jobs, an web Gui
to submit and track the jobs from the browser and a job runner.
The architechture details of the patch is as follows :
- An interface AdminJob which is an abstract class representing a Job
in nutch.
- various classes extending AdminJob. for ex FetchAdminJob, IndexAdminJob.
- A queue which sorts the jobs in priority order, by a modified a
topological sort(jobs can be dependent).
- an interface to submit Jobs
- a rpc server to listen to job submissions
- an extension point (basically same as the previous)
- a web server to serve plugin jsp's
upon the features will be
- submitting jobs from code, command line or web interface,
- tracking jobs from the command line or web interface
- scheduling jobs
I could send the code or details if anyone is interested in pretesting.
And i will appreciate any comments and suggestions on this. I am
planning to complete the patch and submit it to Jira ASAP.
Sami Siren wrote:
Hello,
It has been a while from a previous release (0.8.1) and looking at the
great fixes done in trunk I'd start thinking about baking a new release
soon.
Looking at the jira roadmaps there are 1 blocking issues (fixing the
license headers) for 0.8.2 and two other blocking issues for 0.9.0 of
which I think NUTCH-233 is safe to put in.
The top 10 voted issues are currently:
NUTCH-61 Adaptive re-fetch interval. Detecting umodified content
NUTCH-48 "Did you mean" query enhancement/refignment feature
NUTCH-251 Administration GUI
NUTCH-289 CrawlDatum should store IP address
NUTCH-36 Chinese in Nutch
NUTCH-185 XMLParser is configurable xml parser plugin.
NUTCH-59 meta
data support in webdb
NUTCH-92 DistributedSearch incorrectly scores results
NUTCH-68 A
tool to generate arbitrary fetchlists NUTCH-87 Efficient
site-specific crawling for a large number of sites
Are there any opinions about issues that should go in before the next
release (Answering yes means that you are willing to provide a patch for
it).
--
Sami Siren