Hi guys,
I want to extend Nutch to use real-time indexing on local file system. I
have been through the source code to find out ways to modify values stored
in CrawlDB. The idea is simple:
I have an external program (or a script) which checks for changes in a
directory (url injected in the crawldb). When there are new changes
recorded, the program will update the status in the crawldb and generate a
new fetch list for the fetcher to fetch. I do not want to make great changes
to the nutch source code as I want the program to be compatible with future
releases. Now, I know the crawldatum is saved in the crawldb with the url. I
am not too sure but I think the url is the key to retrieve the crawldatum.
For my program to work successfully, I need to know the following:
* How to read data from the crawldb; what data structure does it use
and how to referenced to it?
* How to write back to the crawldb; updating information back to the
crawldb or probably creating a new with changed and unchanged values.
This is an extract from the crawldb:
http://some-url.com/ Version: 4
Status: 2 (DB_fetched)
Fetch time: Thu Feb 22 12:44:05 GMT 2007
Modified time: Thu Jan 01 01:00:00 GMT 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0323955
Signature: f4c14c46074b66aad8829b8aa84cd636
Metadata: null
How can get this information with an external program and modify/ update it.
Once I know how to implement that part, I can call nutch in the usual way of
generate - fetch - updatedb - updatelinkdb -index -etc.. so generate will
have the new value that I want re-indexed. This will stop the fetcher from
fetching a long list of urls (changed or unchanged but need fetching because
of their next_fetch_time is due). The program gets its update from the
underlying OS to know notify about any changes to files and folders being
monitored. Once the program is working with sufficient tests, I will be
willing to share the source code; it's written in java and doesn't need any
script to launch nutch.
I will be looking forward to your kind support.
Armel
-------------------------------------------------
Armel T. Nene
iDNA Solutions
Tel: +44 (207) 257 6124
Mobile: +44 (788) 695 0483
<http://blog.idna-solutions.com/> http://blog.idna-solutions.com
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers