Re: Nutch Crawler, Page Rediction and Pagination

EM Sun, 25 Sep 2005 21:19:14 -0700


I know that if you are big user (several dedicated machines in a data
center with fast connection...) you probably don't care about this, your
crawler will run over any website, with 50-500 threads the default three
retry times, and the problem will solve itself out. But, can something
be done for the rest or us, please?

No, I don't think so. Some web designer put the "url director" asobstacle of search engine. It is common in China. And you cannot get

conten of these websites at all.

Maybe I wasn't totally clear, with 10 seconds of timeout , the fetcherwill jump over bunch of pages on the same host. Any obstacles will bepretty much ignored because these pages won't be fetched and pagesleading from them won't be fetched also. On large scale, search enginetraps or not, the fetcher will play rough an get over them in 3 runs(actually a bit more since some pages will be fetched). This is ofcourse the case if you don't need 100% of the pages, just as many as youcan fetch.

People who are technically able to put search engine traps should betechnically able to put robots.txt, of course with both sides notobeying the rules, it's a bit of a mess lately and everyone are payingthe price.

I've encountered cases where spam was the issue and not search enginetraps. There's this website that has mod rewrite or something like thatsetup so ANY RANDOM link you can type on his page is valid and it willshow you bunch of unrelated random advertisements. This is static pageby the way. Now, if I had 100mbps my fetcher would run over his websitewithout blinking, being limited to 2 the effect is noticeable. No matterhow many times I run the fetcher, the number of instructions left wasn'tdecreasing ;) I've encountered cases like this, and instead of manuallytyping regex to clean them off (which takes time) I'd strongly prefer anautomated solution if possible.


Regards,
EM

Re: Nutch Crawler, Page Rediction and Pagination

Reply via email to