[jira] [Commented] (NUTCH-685) Content-level redirect status lost in ParseSegment

Sebastian Nagel (JIRA) Tue, 19 Aug 2014 03:14:54 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14102092#comment-14102092
 ]


Sebastian Nagel commented on NUTCH-685:
---------------------------------------

Confirmed (for 1.x): content-level redirects (aka. meta refresh) do not result 
in a redirect status in CrawlDb (db_redir_perm or db_redir_temp).

In current trunk/1.x, they are even not recorded if Fetcher is parsing 
(fetcher.parse==true):
* set status "redir_perm" was introduced with r492525 in Fetcher.java:
{code}
 case ProtocolStatus.SUCCESS:        // got a page
   pstatus = output(url, datum, content, status, 
CrawlDatum.STATUS_FETCH_SUCCESS);
   if (pstatus != null && pstatus.isSuccess() &&
       pstatus.getMinorCode() == ParseStatus.SUCCESS_REDIRECT) {
...
      // record that we were redirected
      output(url, datum, null, status, CrawlDatum.STATUS_FETCH_REDIR_PERM);
{code}
* but lost with r593151 (since release 1.0 / NUTCH-547)

The problem is that pages containing a content-level redirect are indexed as 
successfully fetched pages. But usually they contain only a note like "You will 
be redirected in 10 seconds. If not click here." Possible solutions to exclude 
those pages (for 1.x):
# mark meta-refresh redirects as such (the status is arguable):
** re-introduce that Fetcher emits a CrawlDatum with redirect status
** try this also for ParseOutput (if fetcher.parse==false): principally 
possible, but with the price of lost information. If we emit a redirect 
CrawlDatum into crawl_parse it overwrites that from crawl_fetch. Status is then 
redirect, but we loose the fetch time and meta data. The original fetch datum 
is not available while parsing segments.
# skip and delete content-level redirects during indexing (similar to 
robots=noindex)
* check for {{parseData.getStatus().getMinorCode() == 
ParseStatus.SUCCESS_REDIRECT}} in IndexerMapReduce
* additionally, (it may not harm!) try to add the metarefresh to CrawlDatum's 
meta

> Content-level redirect status lost in ParseSegment
> --------------------------------------------------
>
>                 Key: NUTCH-685
>                 URL: https://issues.apache.org/jira/browse/NUTCH-685
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Julien Nioche
>             Fix For: 1.10
>
>
> When Fetcher runs in parsing mode, content-level redirects (HTML meta tag 
> "Refresh") are properly discovered and recorded in crawl_fetch under source 
> URL and target URL. If Fetcher runs in non-parsing mode, and ParseSegment is 
> run as a separate step, the content-level redirection data is used only to 
> add the new (target) URL, but the status of the original URL is not reset to 
> indicate a redirect. Consequently, status of the original URL will be 
> different depending on the way you run Fetcher, whereas it should be the same.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (NUTCH-685) Content-level redirect status lost in ParseSegment

Reply via email to