Hi I have changed the protocol-http plugin so that Nutch will read from local file system, instead of from the Internet, on those already-crawled pages. (I tried to use FILE:// protocol, but it seemed to me the interconnection information among pages were lost). Right now, I have made it work, but it's very slow. It took 10 minutes executing "fetch" command on 400 pages. And I was on a 4 CPU box with 4 threads. I am wondering if this is normal, because this is euqal to 400 hours/box to read 1 million pages, which is >15 days.
Any suggestion will be appreciated. Zhen ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
