On Sat, May 15, 2004 at 10:00:33AM -0400, Luke Baker wrote:
> Hey all,
> 
> I'm trying to figure out how one could add functionality to Nutch which 
> would allow users to specify a substitution/replace using regex for 
> URLs.  This is more useful in an intranet crawl than a large scale 
> internet crawl.  The main use for this might be for stripping off 
> session IDs of URLs.  They are unnecessary and can result in many 
> "duplicate" URLs (URLs that are identical with the exception of the 
> session ID).

It boils down to how two urls are considered "equivalent".
A little class based on regex may be a resonable one.
The "equivalence" should be definable by user via a config file.

> 
> Where would be the best place to put such functionality?
> I've thought of a few places, but I wonder about the scalability of some 
> of them.
> * create it as some sort of tool that analyzes the WebDB before 
> generating the fetch list
> * do the replace as we discover URLs
> * do the replace before the actual fetch URLs

Based on my understanding of nutch code, it might be needed in all these
places. Again, it won't be a bad idea to make it configrable, so user
can switch it on/off.

John


-------------------------------------------------------
This SF.Net email is sponsored by: SourceForge.net Broadband
Sign-up now for SourceForge Broadband and get the fastest
6.0/768 connection for only $19.95/mo for the first 3 months!
http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to