Sebastian Nagel created NUTCH-1885: -------------------------------------- Summary: Protocol-file should treat symbolic links as redirects Key: NUTCH-1885 URL: https://issues.apache.org/jira/browse/NUTCH-1885 Project: Nutch Issue Type: Sub-task Components: protocol Affects Versions: 2.2.1, 1.9 Reporter: Sebastian Nagel Priority: Minor Fix For: 2.3, 1.10
(reported by [~angela_wang], see NUTCH-1884, [[1|https://www.mail-archive.com/dev@nutch.apache.org/msg15614.html]] and [[2|https://www.mail-archive.com/dev@nutch.apache.org/msg15610.html]]) If a file is a symbolic link or contains a link on it's path:, protocol-file follows the link immediately and returns a Content object with the canonical path (all symbolic links resolved) in field "Location". This may cause - the Parse object not available under its expected URL (see NUTCH-1884) - dubious CrawlDatums (status fetched!) in CrawlDb (first URL is a symbolic link to second item): {noformat} file:/var/www/redir_test.html Version: 7 Status: 2 (db_fetched) ... Signature: null Metadata: Content-Type=text/html _pst_=success(1), lastModified=0 file:/var/www/test.html Version: 7 Status: 2 (db_fetched) ... Signature: 50fa8436398f0ecb6b15eaba0574ef23 Metadata: Content-Type=text/html _pst_=success(1), lastModified=0 {noformat} Because signature is null these will never result in duplicates in index. Protocol-file should instead explicitly redirect to the link target. This should be the default, optionally we could add a property to restore the old behavior. Should not be difficult to resolve: FileResponse already has status "redirect" for symlinks, but File.getProtocolOutput() then resolves the links internally. So we just need to return a redirect response before links are resolved/followed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)