I know that if you are big user (several dedicated machines in a data
center with fast connection...) you probably don't care about this, your
crawler will run over any website, with 50-500 threads the default three
retry times, and the problem will solve itself out. But, can something
be done for the rest or us, please?
No, I don't think so. Some web designer put the "url director" as obstacle of search engine. It is common in China. And you cannot get
conten of these websites at all.
Maybe I wasn't totally clear, with 10 seconds of timeout , the fetcher will jump over bunch of pages on the same host. Any obstacles will be pretty much ignored because these pages won't be fetched and pages leading from them won't be fetched also. On large scale, search engine traps or not, the fetcher will play rough an get over them in 3 runs (actually a bit more since some pages will be fetched). This is of course the case if you don't need 100% of the pages, just as many as you can fetch.

People who are technically able to put search engine traps should be technically able to put robots.txt, of course with both sides not obeying the rules, it's a bit of a mess lately and everyone are paying the price.

I've encountered cases where spam was the issue and not search engine traps. There's this website that has mod rewrite or something like that setup so ANY RANDOM link you can type on his page is valid and it will show you bunch of unrelated random advertisements. This is static page by the way. Now, if I had 100mbps my fetcher would run over his website without blinking, being limited to 2 the effect is noticeable. No matter how many times I run the fetcher, the number of instructions left wasn't decreasing ;) I've encountered cases like this, and instead of manually typing regex to clean them off (which takes time) I'd strongly prefer an automated solution if possible.

Regards,
EM


Reply via email to