[jira] [Updated] (NUTCH-2748) Fetch status gone (redirect exceeded) not to overwrite existing items in CrawlDb

Sebastian Nagel (Jira) Fri, 18 Oct 2019 08:37:32 -0700


     [ 
https://issues.apache.org/jira/browse/NUTCH-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sebastian Nagel updated NUTCH-2748:
-----------------------------------
    Description: 
If fetcher is following redirects and the max. number of redirects in a 
redirect chain (http.max.redirect) is reached, fetcher stores a CrawlDatum item 
with status "fetch_gone" and protocol status "redir_exceeded". During the next 
CrawlDb update the "gone" item will set the status of existing items (including 
"db_fetched") with "db_gone". It shouldn't as there has been no fetch of the 
final redirect target and indeed nothing is know about it's status. An wrong 
db_gone may then cause that a page gets deleted from the search index.

There are two possible solutions:
1. ignore protocol status "redir_exceeded" during CrawlDb update
2. when http.redirect.max is hit the fetcher stores nothing or a redirect 
status instead of a fetch_gone

Solution 2. seems easier to implement and it would be possible to make the 
behavior configurable:
- store the redirect target as outlink, i.e. same behavior as if 
http.redirect.max == 0
- store "fetch_gone" (current behavior)
- store nothing, i.e. ignore those redirects - this should be the default as 
it's close to the current behavior without the risk to accidentally set 
successful fetches to db_gone





  was:
If fetcher is following redirects and the max. number of redirects in a 
redirect chain (http.max.redirect) is reached, fetcher stores a CrawlDatum item 
with status "fetch_gone" and protocol status "redir_exceeded". During the next 
CrawlDb update the "gone" item will set the status of existing items (including 
"db_fetched") with "db_gone". It shouldn't as there has been no fetch of the 
final redirect target and indeed nothing is know about it's status. An wrong 
db_gone may then cause that a page gets deleted from the search index.

There are two possible solutions:
1. ignore protocol status "redir_exceeded" during CrawlDb update
2. when http.redirect.max is hit the fetcher stores nothing or a redirect 
status instead of a fetch_gone

Solution 2. seems easier to implement and it would be possible to make the 
behavior configurable:
- store redirect (fetch_redir_temp or fetch_redir_perm)
- store "fetch_gone" (current behavior)
- store nothing, i.e. ignore those redirects - this should be the default as 
it's close to the current behavior without the risk to accidentally set 
successful fetches to db_gone






> Fetch status gone (redirect exceeded) not to overwrite existing items in 
> CrawlDb
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-2748
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2748
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb, fetcher
>    Affects Versions: 1.16
>            Reporter: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.17
>
>
> If fetcher is following redirects and the max. number of redirects in a 
> redirect chain (http.max.redirect) is reached, fetcher stores a CrawlDatum 
> item with status "fetch_gone" and protocol status "redir_exceeded". During 
> the next CrawlDb update the "gone" item will set the status of existing items 
> (including "db_fetched") with "db_gone". It shouldn't as there has been no 
> fetch of the final redirect target and indeed nothing is know about it's 
> status. An wrong db_gone may then cause that a page gets deleted from the 
> search index.
> There are two possible solutions:
> 1. ignore protocol status "redir_exceeded" during CrawlDb update
> 2. when http.redirect.max is hit the fetcher stores nothing or a redirect 
> status instead of a fetch_gone
> Solution 2. seems easier to implement and it would be possible to make the 
> behavior configurable:
> - store the redirect target as outlink, i.e. same behavior as if 
> http.redirect.max == 0
> - store "fetch_gone" (current behavior)
> - store nothing, i.e. ignore those redirects - this should be the default as 
> it's close to the current behavior without the risk to accidentally set 
> successful fetches to db_gone



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (NUTCH-2748) Fetch status gone (redirect exceeded) not to overwrite existing items in CrawlDb

Reply via email to