POIRIER David wrote:
Yoav,
You are right. With the help of the "protocol-httpclient" plugin you
will be able to use cookies when crawling. There is one thing that you
need to watch out though (quoting Susam Pal): "protocol-httpclient does
this for a single fetch cycle".
To be honest I don't exactly know how to define a "fetch cycle". Based
on my experience it seems that every time the fetcher goes one level
deeper into a web site it starts a new cycle... or if it doesn't I loose
the cookie. It might be because of something else, but I don't think so.
If anybody has the answer to that, please let Yoav and I know.
This is correct. It comes from the fact that Nutch doesn't store cookies
(that's yet another potential use for the planned HostDB functionality).
This means that in order to accept and use cookies:
* you have to use protocol-httpclient. There is no support for cookies
in protocol-http.
* your fetchlist needs to have more than 1 url from the host - the first
request will presumably set the cookies, if you are lucky. ;)
* cookies are accumulated and kept in memory for the duration of the
current crawl task.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com