Hi,
I'm new to this list.
I have some questions about Nutch to see if it suits my needs.
First of all, I have a database that contains 50 000 URLs classified by categories and sub-categories, I wish to fully crawl the 50 000 sites behind those URLs. No problem I can provide the urls to nutch.
I want to use the categories informations in searches to restrict results, for example a user can search all sites that contains cat in pet category. Is it possible with Nutch ? I've seen that I can add plugins, perhaps is it possible with plugins ?
Second part: hardware requirements.
Lets say that each website have a maximum of 1000 pages, I must store the index for 50 000 000 pages. How many disk storage do I need ?
I've seen that Mozdex works with 10 servers for 100 000 000 pages but I don't know how many requests it serves. Is there something to do to reduce the number of servers ?
Thanks for your replies.
PS: sorry for my very bad english.
