Hi, Armel T. Nene wrote: > Hi guys, > > > > I want to extend Nutch to use real-time indexing on local file system. I > have been through the source code to find out ways to modify values stored > in CrawlDB. The idea is simple: > > > > I have an external program (or a script) which checks for changes in a > directory (url injected in the crawldb). When there are new changes > recorded, the program will update the status in the crawldb and generate a > new fetch list for the fetcher to fetch. I do not want to make great changes > to the nutch source code as I want the program to be compatible with future > releases. Now, I know the crawldatum is saved in the crawldb with the url. I > am not too sure but I think the url is the key to retrieve the crawldatum. > For my program to work successfully, I need to know the following: > > > > * How to read data from the crawldb; what data structure does it use > and how to referenced to it? >
Crawldb is essentially a list of <url, CrawlDatum> pairs and is stores as a MapFile. So you can read it with MapFile.Reader.get. > * How to write back to the crawldb; updating information back to the > crawldb or probably creating a new with changed and unchanged values. > Current FS implementation is write-once, so you can't modify it. But you can read it one-by-one(possibly with MapFile.Reader.next) and then write a new one with MapFile.Writer. > > > This is an extract from the crawldb: > > > > http://some-url.com/ Version: 4 > > Status: 2 (DB_fetched) > > Fetch time: Thu Feb 22 12:44:05 GMT 2007 > > Modified time: Thu Jan 01 01:00:00 GMT 1970 > > Retries since fetch: 0 > > Retry interval: 30.0 days > > Score: 1.0323955 > > Signature: f4c14c46074b66aad8829b8aa84cd636 > > Metadata: null > > > > How can get this information with an external program and modify/ update it. > Once I know how to implement that part, I can call nutch in the usual way of > generate - fetch - updatedb - updatelinkdb -index -etc.. so generate will > have the new value that I want re-indexed. This will stop the fetcher from > fetching a long list of urls (changed or unchanged but need fetching because > of their next_fetch_time is due). The program gets its update from the > underlying OS to know notify about any changes to files and folders being > monitored. Once the program is working with sufficient tests, I will be > willing to share the source code; it's written in java and doesn't need any > script to launch nutch. > > > > I will be looking forward to your kind support. > > > > Armel > > > > ------------------------------------------------- > > Armel T. Nene > > iDNA Solutions > > Tel: +44 (207) 257 6124 > > Mobile: +44 (788) 695 0483 > > <http://blog.idna-solutions.com/> http://blog.idna-solutions.com > > > > > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers