Doug: replying to email sent to [email protected] sets To to [EMAIL PROTECTED]
J�r�me: Once can provide If-Modified-Since information in GET requests, too. I think that's preferable to HEAD, because with HEAD requests one would have to first perform a HEAD request, and then another GET for changed pages. With the conditional GET request a single request is all that's needed, as long as the If-Modified-Since request header is provided. Otis --- J�r�me Charron <[EMAIL PROTECTED]> wrote: > Hello, > > Looking at the HttpProtocol plugin code, I saw some ways of > improvements, but not sure they are feasible: > > 1. The HttpProtocol plugin always performs some GET methods. In my > mind, a crawler designed to crawl to web (ie that will frequently > update its index and documents) need to use the http HEAD method in > order to know if the requested URL has been modified since the last > crawl. Such implementation drastically reduce the needed band-width > needed to update a set of documents: It only downloads the changed > documents (I better understand why some messages on the list express > the monthly needed band-width for a set of page as a constant value). > I don't have yet enough Nutch knowledge to see what are the > implications on the index/segments management, because I imagine that > such a mechanism implies that we can: > * preserve the previously fetched documents but not re-fetched due to > a "Not-Changed" HEAD response. > * delete a document that no more exist (Protocol plugin must be able > to return to the nutch core a return code that distinguish between > "document no more exist" and "document not changed since last time"). > > 2. I think the Http Pipelining could be a good way of performances > improvements too. What do you think about it? > > Thanks > > > Jerome > > > -- > http://motrech.free.fr/ - motrech [home] > http://motrech.blogspot.com/ - motrech [blog] > http://fr.groups.yahoo.com/group/motrech - motrech [liste] > http://fr.groups.yahoo.com/group/frutch - frutch [liste] > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real > users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > Nutch-developers mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/nutch-developers >
