I would like to build an engine based on a hand full of hand picked sites from a specific domain. I had a few questions.
How many documents can I fit on a single server implementation (2 cpu xeon)? With space being irrelevant aprox. how many documents can I have on a single node with respectable search performance? My idea is to have a hand full of sites that I judge for quality and index these on a regular basis maybe... once a month. I would like to add new sites over time. Does this sound feasible with nutch? What method would be best for this type of application? I setup nutch and crawled a very small sample using method 1 in the tutorial "Intranet crawl" I was unable to get whole web crawl to work. What is that -dmozfile flag? I don't want to base this off dmoz. If anyone could point me to some documentation or tutorial that better explains whole web crawling I would appreciate it. Thanks a lot.
