Hi,

I'm new to this list.

I have some questions about Nutch to see if it suits my needs.

First of all, I have a database that contains 50 000 URLs classified by categories and sub-categories, I wish to fully crawl the 50 000 sites behind those URLs. No problem I can provide the urls to nutch.

I want to use the categories informations in searches to restrict results, for example a user can search all sites that contains cat in pet category. Is it possible with Nutch ? I've seen that I can add plugins, perhaps is it possible with plugins ?


Second part: hardware requirements.

Lets say that each website have a maximum of 1000 pages, I must store the index for 50 000 000 pages. How many disk storage do I need ?
I've seen that Mozdex works with 10 servers for 100 000 000 pages but I don't know how many requests it serves. Is there something to do to reduce the number of servers ?


Thanks for your replies.

PS: sorry for my very bad english.

Reply via email to