On Sat, May 15, 2004 at 10:00:33AM -0400, Luke Baker wrote: > Hey all, > > I'm trying to figure out how one could add functionality to Nutch which > would allow users to specify a substitution/replace using regex for > URLs. This is more useful in an intranet crawl than a large scale > internet crawl. The main use for this might be for stripping off > session IDs of URLs. They are unnecessary and can result in many > "duplicate" URLs (URLs that are identical with the exception of the > session ID).
It boils down to how two urls are considered "equivalent". A little class based on regex may be a resonable one. The "equivalence" should be definable by user via a config file. > > Where would be the best place to put such functionality? > I've thought of a few places, but I wonder about the scalability of some > of them. > * create it as some sort of tool that analyzes the WebDB before > generating the fetch list > * do the replace as we discover URLs > * do the replace before the actual fetch URLs Based on my understanding of nutch code, it might be needed in all these places. Again, it won't be a bad idea to make it configrable, so user can switch it on/off. John ------------------------------------------------------- This SF.Net email is sponsored by: SourceForge.net Broadband Sign-up now for SourceForge Broadband and get the fastest 6.0/768 connection for only $19.95/mo for the first 3 months! http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
