Re: Redirects and alias handling (LONG)

Ken Krugler Wed, 15 Aug 2007 11:15:09 -0700

Hi Andrzej,

And even with deduping, we run into problems, especially for top-level pages.
These often change slightly between crawls, so ifhttp://example.com is found during one pass, and a differenthttp://www.example.com is found at a later crawl, you wind up withtwo hits for a result. What's worse is that typically the summaryis exactly the same (from the body of the page), so to a user it'spainfully obvious that there are (near) duplicates in the index.
To solve this, I think a near duplicate detector would need to beused when collapsing similar URLs. If you did this only when twoURLs appear to be the same, I think it would be OK, as that's themost common case. Thus it could be somewhat computationallyexpensive (e.g. a winnowing alahttp://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf).
Interesting paper, thanks for the pointer - I always wondered whatcriteria to use to reduce the number of shingles, and this winnowingis a simple enough recipe for creating page signatures. I may betempted to implement it ;)

I took a quick scan through the public code and didn't find anythingthat looked appropriate for this. One more potentially useful paperis here:


http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf

There is a Signature implementation in Nutch that allows for smalldifferences in text (TextProfileSignature), but I guess it's notsufficient in your case?

I thought we were using that, but I just double-checked and we'renot. So I'll try to switch over to that for the next crawl/index, tosee how well it works.


Thanks,

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Re: Redirects and alias handling (LONG)

Reply via email to