Hello, Looking at the HttpProtocol plugin code, I saw some ways of improvements, but not sure they are feasible:
1. The HttpProtocol plugin always performs some GET methods. In my mind, a crawler designed to crawl to web (ie that will frequently update its index and documents) need to use the http HEAD method in order to know if the requested URL has been modified since the last crawl. Such implementation drastically reduce the needed band-width needed to update a set of documents: It only downloads the changed documents (I better understand why some messages on the list express the monthly needed band-width for a set of page as a constant value). I don't have yet enough Nutch knowledge to see what are the implications on the index/segments management, because I imagine that such a mechanism implies that we can: * preserve the previously fetched documents but not re-fetched due to a "Not-Changed" HEAD response. * delete a document that no more exist (Protocol plugin must be able to return to the nutch core a return code that distinguish between "document no more exist" and "document not changed since last time"). 2. I think the Http Pipelining could be a good way of performances improvements too. What do you think about it? Thanks Jerome -- http://motrech.free.fr/ - motrech [home] http://motrech.blogspot.com/ - motrech [blog] http://fr.groups.yahoo.com/group/motrech - motrech [liste] http://fr.groups.yahoo.com/group/frutch - frutch [liste]
