Sebastian Nagel created NUTCH-1885:
--------------------------------------
Summary: Protocol-file should treat symbolic links as redirects
Key: NUTCH-1885
URL: https://issues.apache.org/jira/browse/NUTCH-1885
Project: Nutch
Issue Type: Sub-task
Components: protocol
Affects Versions: 2.2.1, 1.9
Reporter: Sebastian Nagel
Priority: Minor
Fix For: 2.3, 1.10
(reported by [~angela_wang], see NUTCH-1884,
[[1|https://www.mail-archive.com/[email protected]/msg15614.html]] and
[[2|https://www.mail-archive.com/[email protected]/msg15610.html]])
If a file is a symbolic link or contains a link on it's path:, protocol-file
follows the link immediately and returns a Content object with the canonical
path (all symbolic links resolved) in field "Location". This may cause
- the Parse object not available under its expected URL (see NUTCH-1884)
- dubious CrawlDatums (status fetched!) in CrawlDb (first URL is a symbolic
link to second item):
{noformat}
file:/var/www/redir_test.html Version: 7
Status: 2 (db_fetched)
...
Signature: null
Metadata:
Content-Type=text/html
_pst_=success(1), lastModified=0
file:/var/www/test.html Version: 7
Status: 2 (db_fetched)
...
Signature: 50fa8436398f0ecb6b15eaba0574ef23
Metadata:
Content-Type=text/html
_pst_=success(1), lastModified=0
{noformat}
Because signature is null these will never result in duplicates in index.
Protocol-file should instead explicitly redirect to the link target. This
should be the default, optionally we could add a property to restore the old
behavior.
Should not be difficult to resolve: FileResponse already has status "redirect"
for symlinks, but File.getProtocolOutput() then resolves the links internally.
So we just need to return a redirect response before links are
resolved/followed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)