[
https://issues.apache.org/jira/browse/NUTCH-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892562#comment-15892562
]
Sebastian Nagel commented on NUTCH-2363:
----------------------------------------
Hi Markus,
I'm a little bit concerned (and agree with Julien) about the complexity
introduced to handle cookies. But I know that it may be the only way to crawl
some sites. Nutch does not provide a way to persist per-domain information such
as cookies and on the protocol level cookies are persisted only for a single
fetch task. Some remarks about the patch:
* are you sure that copying arbitrary meta data from a link or redirect source
to the target CrawlDatum may never harm? Wouldn't it be better to use a
restricted and configurable set, or do this only for the "Cookie" meta data key?
* handleCookies(datum, content, queueId) is only called from 2 places: success
and redirect (no need for 404s, exceptions, etc.): Maybe call it directly
there: it's line of code more, but one indirection less and also a shorter
argument list of output(...), easier to understand what's happening.
* CookieScoringFilter could be simplified by inheriting from
AbstractScoringFilter
> Fetcher support for reading and setting cookies
> -----------------------------------------------
>
> Key: NUTCH-2363
> URL: https://issues.apache.org/jira/browse/NUTCH-2363
> Project: Nutch
> Issue Type: New Feature
> Components: fetcher
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.13
>
> Attachments: NUTCH-2363.patch
>
>
> Patch adds basic support for cookies in the fetcher, and a scoring plugin
> that passes cookies to its outlinks, within the domain. Sub-domain or path
> based is not supported.
> This is useful if you want to maintain sessions or need to get around a
> cookie wall.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)