[
https://issues.apache.org/jira/browse/NUTCH-685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14102092#comment-14102092
]
Sebastian Nagel commented on NUTCH-685:
---------------------------------------
Confirmed (for 1.x): content-level redirects (aka. meta refresh) do not result
in a redirect status in CrawlDb (db_redir_perm or db_redir_temp).
In current trunk/1.x, they are even not recorded if Fetcher is parsing
(fetcher.parse==true):
* set status "redir_perm" was introduced with r492525 in Fetcher.java:
{code}
case ProtocolStatus.SUCCESS: // got a page
pstatus = output(url, datum, content, status,
CrawlDatum.STATUS_FETCH_SUCCESS);
if (pstatus != null && pstatus.isSuccess() &&
pstatus.getMinorCode() == ParseStatus.SUCCESS_REDIRECT) {
...
// record that we were redirected
output(url, datum, null, status, CrawlDatum.STATUS_FETCH_REDIR_PERM);
{code}
* but lost with r593151 (since release 1.0 / NUTCH-547)
The problem is that pages containing a content-level redirect are indexed as
successfully fetched pages. But usually they contain only a note like "You will
be redirected in 10 seconds. If not click here." Possible solutions to exclude
those pages (for 1.x):
# mark meta-refresh redirects as such (the status is arguable):
** re-introduce that Fetcher emits a CrawlDatum with redirect status
** try this also for ParseOutput (if fetcher.parse==false): principally
possible, but with the price of lost information. If we emit a redirect
CrawlDatum into crawl_parse it overwrites that from crawl_fetch. Status is then
redirect, but we loose the fetch time and meta data. The original fetch datum
is not available while parsing segments.
# skip and delete content-level redirects during indexing (similar to
robots=noindex)
* check for {{parseData.getStatus().getMinorCode() ==
ParseStatus.SUCCESS_REDIRECT}} in IndexerMapReduce
* additionally, (it may not harm!) try to add the metarefresh to CrawlDatum's
meta
> Content-level redirect status lost in ParseSegment
> --------------------------------------------------
>
> Key: NUTCH-685
> URL: https://issues.apache.org/jira/browse/NUTCH-685
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.0.0
> Reporter: Andrzej Bialecki
> Assignee: Julien Nioche
> Fix For: 1.10
>
>
> When Fetcher runs in parsing mode, content-level redirects (HTML meta tag
> "Refresh") are properly discovered and recorded in crawl_fetch under source
> URL and target URL. If Fetcher runs in non-parsing mode, and ParseSegment is
> run as a separate step, the content-level redirection data is used only to
> add the new (target) URL, but the status of the original URL is not reset to
> indicate a redirect. Consequently, status of the original URL will be
> different depending on the way you run Fetcher, whereas it should be the same.
--
This message was sent by Atlassian JIRA
(v6.2#6252)