Hi,

I have thinked about this problem (automatic detection
of different URLs with the same content) lately and
the only type of solution that I�ve found so far is to
take a "selective hash" of the contents of the url to
detect when two "close" URLs produce the same content.

[]s

 --- Luke Baker <[EMAIL PROTECTED]> escreveu: > Hey
all,
> 
> I'm trying to figure out how one could add
> functionality to Nutch which 
> would allow users to specify a substitution/replace
> using regex for 
> URLs.  This is more useful in an intranet crawl than
> a large scale 
> internet crawl.  The main use for this might be for
> stripping off 
> session IDs of URLs.  They are unnecessary and can
> result in many 
> "duplicate" URLs (URLs that are identical with the
> exception of the 
> session ID).
> 
> Where would be the best place to put such
> functionality?
> I've thought of a few places, but I wonder about the
> scalability of some 
> of them.
> * create it as some sort of tool that analyzes the
> WebDB before 
> generating the fetch list
> * do the replace as we discover URLs
> * do the replace before the actual fetch URLs
> 
> Also, I'm curious about automatic detection of
> things like session IDs. 
>   Does anyone have some good ideas about that?  My
> only idea is doing 
> extra fetches for each page (taking off different
> parameters as we go) 
> and comparing the pages' content with one another. 
> Again this extra 
> work is probably not scalable to past an intranet
> crawl.  The reason I 
> mention this now is that if we want to allow the
> possibility of 
> automatic detection, then it might affect where we
> want the URL regex 
> functionality to go.
> 
> Thanks for the pointers,
> 
> Luke Baker
> 
> 
>
-------------------------------------------------------
> This SF.Net email is sponsored by: SourceForge.net
> Broadband
> Sign-up now for SourceForge Broadband and get the
> fastest
> 6.0/768 connection for only $19.95/mo for the first
> 3 months!
>
http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click
> _______________________________________________
> Nutch-developers mailing list
> [EMAIL PROTECTED]
>
https://lists.sourceforge.net/lists/listinfo/nutch-developers 


-------------------------------------------------------
This SF.Net email is sponsored by: SourceForge.net Broadband
Sign-up now for SourceForge Broadband and get the fastest
6.0/768 connection for only $19.95/mo for the first 3 months!
http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to