Doğacan Güney wrote:
On 6/7/07, Emmanuel JOKE <[EMAIL PROTECTED]> wrote:
Hi Guys,

I've different website which set a cookie session and then allow the user to
surf on the site.
I would like to crawl those site but I don't know if Nutch know how to
manage cookie session.
Could you confirm ?

I'm completly lost with the different plugin which are use to crawl with the
HTTP protocol.
Is it lib-http, protocol-http or protocol-httpclient ?
What is the difference between all of them ?

I would appreciate your view, it will help me to implement the management
of cookie in Nutch.

I forgot to answer your question:)

If you only need to remember session cookie during one round of fetch,
it is pretty simple. In lib-http, when you get a cookie put it in a
Map (from hosts to strings) then when you are fetching next url from
the same host, get the cookie and add it to your request.

If you want to remember cookies across fetcher, well.... I am not sure
how to do it:) Perhaps, you can write an extra job that puts the
cookie to every datum from that host, then pick it up in fetcher. Or
perhaps someone has a better idea :)

Actually, if you use protocol-httpclient, it handles cookies properly without any additional configuration.

However, they are not stored anywhere, so they will be valid only for the duration of a single fetch.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to