In a previous post I made "to the wrong list", I asked this question which 
Kir politely answered:

How does one clean things up. Here's my example of real data:

ASPseek database statistics

   Status    Expired      Total
  -----------------------------
        0        211        211 Not indexed yet
      200          0       4738 OK
      301          0        129 Moved Permanently
      302          0        311 Moved Temporarily
      403          0          5 Forbidden
      404          0       2902 Not found
  -----------------------------
    Total        211       8296

Kir's answer:

If you want to index not-indexed-yet URLs (status 0), use
index -s 0

OK I can understand this and it does indeed work for the reindexing. But now 
I have another question on these same lines. You'll notice that adding up 
all the URLs in the NON 200 status is roughly 50% of the total URLs. OK so 
it doesn't take up much space, but....

Most likely all those 404 Not Found URLs (2,902 of them) will never be found 
because they have "removed" them from their server. These are all dead 
links. The way I see it, aspseek (index) will try to fetch them again when 
their index time is due. Why go through all this if these pages don't exist 
anyway. No sense in asking for something we know isn't there. That MUST take 
unecessary resources.

So my question is can I do this without fear of breaking aspseek?

index -C -s 404
index -C -s 403
index -C -s 301
index -C -s 302

and if I don't want to keep trying to get status 0 (probably DNS timeouts 
which I don't want to wait around for anyway)

index -C -s 0

which will now leave me with only status 200 URLs.

If the above will work do I then need to run this:

index -X1
index -X2
index -H

then from a mysql prompt do:

OPTIMZE TABLE urlword;

will this effectively remove all these and at the same time not break 
aspseek? Is the order of operation above correct?

My total index will be about 4 million URLs when done. If roughly 50% of 
them are non 200 status I can't see trying to reindex 2 million URLs that 
will never be fetched anyway. I don't care if these non 200 URLs ever make 
it to the database anyway.

Thanks a million for your help!

_________________________________________________________________
MSN Photos is the easiest way to share and print your photos: 
http://photos.msn.com/support/worldwide.aspx

Reply via email to