I'll give this a shot, although I am still kinda new with nutch take my suggestions with a grain of salt.
First of all your not going to move onto site2 until site1 is finished crawling... so depending on your depth and the size of the site it might take forever to just crawl 1 site. What I have had success doing is setting up a crawl for just 1 site at a time. That way I can have a url filter tailored just for that page, its much easier to keep strait. I won't go into detail because I wouldn't not want to provide incorrect info to the list... but I have multiple instances of "nutch crawl" running at the same time each with there own set of URL filters specifically designed for just 1 site. When the crawls are done I combine the segments and index all the information. If you would like me to go into detail further I can. This would work well in your situation as you could create 20 instances of nutch and just kick off all 20 crawls at once (if your machine can handle it). It would also make each url filter much easier to develop and test. You could even write a very simple script that would kick off all the crawls at once and then combine them into 1 index when you were done. If you are interested and someone else who knows more then I do say that this method is acceptable I will explain it in a future mailing. P.S as far as what architecture "Whole Web" or "Intranet" I would stick with what you are doing until you understand it better and then make a decision if the added flex ability of doing "whole web crawling" is a benefit to you.
