Hi, I have thinked about this problem (automatic detection of different URLs with the same content) lately and the only type of solution that I�ve found so far is to take a "selective hash" of the contents of the url to detect when two "close" URLs produce the same content.
[]s --- Luke Baker <[EMAIL PROTECTED]> escreveu: > Hey all, > > I'm trying to figure out how one could add > functionality to Nutch which > would allow users to specify a substitution/replace > using regex for > URLs. This is more useful in an intranet crawl than > a large scale > internet crawl. The main use for this might be for > stripping off > session IDs of URLs. They are unnecessary and can > result in many > "duplicate" URLs (URLs that are identical with the > exception of the > session ID). > > Where would be the best place to put such > functionality? > I've thought of a few places, but I wonder about the > scalability of some > of them. > * create it as some sort of tool that analyzes the > WebDB before > generating the fetch list > * do the replace as we discover URLs > * do the replace before the actual fetch URLs > > Also, I'm curious about automatic detection of > things like session IDs. > Does anyone have some good ideas about that? My > only idea is doing > extra fetches for each page (taking off different > parameters as we go) > and comparing the pages' content with one another. > Again this extra > work is probably not scalable to past an intranet > crawl. The reason I > mention this now is that if we want to allow the > possibility of > automatic detection, then it might affect where we > want the URL regex > functionality to go. > > Thanks for the pointers, > > Luke Baker > > > ------------------------------------------------------- > This SF.Net email is sponsored by: SourceForge.net > Broadband > Sign-up now for SourceForge Broadband and get the > fastest > 6.0/768 connection for only $19.95/mo for the first > 3 months! > http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click > _______________________________________________ > Nutch-developers mailing list > [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/nutch-developers ------------------------------------------------------- This SF.Net email is sponsored by: SourceForge.net Broadband Sign-up now for SourceForge Broadband and get the fastest 6.0/768 connection for only $19.95/mo for the first 3 months! http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
