Kashif:
 
Unfortunately it seems that given a subtle change in the page (in this case the Bold Title name) at the top of the page, this will result in a duplicate page, as the MD5s will not be the same. You can end up with 50-100 pages of such junk.
 
This goes back to what I brought up a few weeks ago -- we may need to incorporate some for of Bayesian analysis to detect similar pages
This is a long project that requires some thought -- it took Google a few years to get something like this in place.
 
A simpler way to clean is to do a periodic check on the WebDB:.
 
There are a few ways to figure out which pages might be "spammy" or contain link farms.
 
1 - You can look at the ratio on inlinks to outlinks ( low inlinks, high outlink is a dead giveaway of a link farm)
2 - remove links from a page to one in the same domain (this is not the problem here but such do exist). Examine resulting links
3 - Get your DNS server involved -- if multiple links all point to the same IP address, and the content size of the pages is within 1% or X bytes, pages are similar 
4 - Link unrolling -- if inlinks is a subset of outlinks or vice-versa remove the forward links or vice-versa.
5- And lastly content to HTML ratio is low -- (SpamAssassin has functions to give you this information). Low content high HTML tags also indicates a page that is less useful -- CAUTION: a lot of home/index pages also fit in the category, so this should not be the main determining factor. Example: msn.com, cnn.com, etc. etc.
  
 


From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Kashif Khadim
Sent: Monday, January 10, 2005 6:07 PM
To: [EMAIL PROTECTED]
Subject: RE: [Nutch-dev] Exploding number links due to bad sites

Duplicate page is big problem which is also include spam.As i grow my index these duplicate pages grow and i am tired of getting this spam with same content in 100's of sites.They show up on top of search results and keep going on many pages of search results full of duplicate contents.
 
One example is given below however still i am unable to remove.
 

Tax Litigation Guide; Discuss Experts, Taxes, Judgment,...
... FinancialConsultant.info FicoScore.info BankruptcyProtection.info BankruptcyPrevention.info TaxLoan.info Preapproval.info SmallBusinessLoans.info CommercialCollections.info CapitalGainsTax.info InterestOnlyLoans.info TaxAccountants ...
http://taxlitigation.info/

Tax Lien Guide; Discuss Taxes, Liens, Lien, Taxation, I...
... FinancialConsultant.info FicoScore.info BankruptcyProtection.info BankruptcyPrevention.info TaxLoan.info Preapproval.info SmallBusinessLoans.info CommercialCollections.info CapitalGainsTax.info InterestOnlyLoans.info TaxAccountants ...
http://taxlien.info/
 
Thanks.
Kashif



Chirag Chaman <[EMAIL PROTECTED]> wrote:
Doug:

Well, sites that do point to the same content are 90% of the time mirrors.

Examples are www.cricinfo.org (located across 8 countries), while the main
page is always the same the links that traverse into may be very different
(see it while a cricket match is in session and all mirror sites have the
same info). I remember also reading of a similar issue where IBM a few
years back had similar home pages for UK and US, but the links that pages
went to were different (different products for different markets).

Another BIG problem is that of spammers -- who add virtual URLs and a few
sites to increase their rank.
Here's an example:

Site A and Site C are virtual domains of the same site.
Site B and Site D are virtual domains of the same site.

Links from site A has a few links pointing to Site B, which in turn points
to C and so on.
Site A --> Site B --> Site C --> Site D --> Site A

Each time a person visits any on the Sites, the site generates links that
points to a new virtual domain.
Say if each A,B,C,D have 10 virtual domains to choose from -- you could get
stuck in this loop a few hundred times if not thousands.

I guess the final implementation decision depends on how you plan to crawl
and use the Search engine -- if its to crawl and index a know corpus
(internal Lan), then Doug's way is definitely the way to go, as in that case
you want the two sites to show up independently.

But, if an Internet crawl is required, I think that means a little more
advanced work.

I've been looking at this issue over the weekend and I think there is a
feature we can add here giving the user some options. We did some work on
the SegmentMergeTool while it was removing duplicate content MD5 -- added an
extension that keeps a log which explai ns the reason code why a page
was/is/will be deleted -- Matching URL or Matching MD5.

(This function could easily be run to check against the WebDB instead of
another segment and is run between successive calls to generate fetch lists,
or while gererating the fetchlist)

I think with Doug's approach and this extension we can solve both the issues
and give the end user options -- I just can't think clearly where this will
go right now, but I know the combined solution should do the work.

Anyone who can make sense of my rambling and put it to good use?







-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Doug
Cutting
Sent: Monday, January 10, 2005 2:52 PM
To: [EMAIL PROTECTED]
Subject: Re: [Nutch-dev] Exploding number links due to bad sites

Chirag Chaman wrote:
> - Do a breath first crawl (which Nutch does)
> - For each page fetched, generate MD5 hash
> - IF MD5 hash is in "WebDB"
> do not add the data to the segment
> mark the Link for deletion
[ ... ]
> The above will AT MOST add one page that is bad and all the other will
> be ignored. This does require that the content hash be available in
> memory, thus you may have to partition it based on the MD5 for faster
access.

Nutch permits multiple URLs with the same MD5 in the database. In many
cases this is a feature. It allows Nutch to figure out which is the most
popular URL. It also allows it to deal with a site that changes its urls
but not all of its contents. Later, when we perform duplicate elimination,
we keep the highest scoring, most-recent version of pages.

But I think a related approach might work. What if we add to the fetchlist
the MD5 of each page that points to a URL. (This is efficient, since we
already include anchor text in fetchlists, and source MD5 is stored along
with link data.) Then, when we fetch a page, if a page which points to it
has the same MD5, we ignore the page. In other words, we look for tight
loops which can, in the presence of relative urls, expand into exponential
numbers of unique urls.

So can anyone think of legitimate pages with different URLs but identical
content that link to one another? Would there be harm in ignoring the
target of such links?

Doug


-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues Get a FREE
limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers




-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers


Do you Yahoo!?
Yahoo! Mail - Find what you need with new enhanced search. Learn more.

Reply via email to