Sebastian Nagel created NUTCH-1885:
--------------------------------------

             Summary: Protocol-file should treat symbolic links as redirects
                 Key: NUTCH-1885
                 URL: https://issues.apache.org/jira/browse/NUTCH-1885
             Project: Nutch
          Issue Type: Sub-task
          Components: protocol
    Affects Versions: 2.2.1, 1.9
            Reporter: Sebastian Nagel
            Priority: Minor
             Fix For: 2.3, 1.10


(reported by [~angela_wang], see NUTCH-1884, 
[[1|https://www.mail-archive.com/dev@nutch.apache.org/msg15614.html]] and 
[[2|https://www.mail-archive.com/dev@nutch.apache.org/msg15610.html]])

If a file is a symbolic link or contains a link on it's path:, protocol-file 
follows the link immediately and returns a Content object with the canonical 
path (all symbolic links resolved) in field "Location". This may cause
- the Parse object not available under its expected URL (see NUTCH-1884)
- dubious CrawlDatums (status fetched!) in CrawlDb (first URL is a symbolic 
link to second item):
{noformat}
file:/var/www/redir_test.html   Version: 7
Status: 2 (db_fetched)
...
Signature: null
Metadata: 
        Content-Type=text/html
        _pst_=success(1), lastModified=0

file:/var/www/test.html Version: 7
Status: 2 (db_fetched)
...
Signature: 50fa8436398f0ecb6b15eaba0574ef23
Metadata: 
        Content-Type=text/html
        _pst_=success(1), lastModified=0
{noformat}
Because signature is null these will never result in duplicates in index.

Protocol-file should instead explicitly redirect to the link target. This 
should be the default, optionally we could add a property to restore the old 
behavior.

Should not be difficult to resolve: FileResponse already has status "redirect" 
for symlinks, but File.getProtocolOutput() then resolves the links internally. 
So we just need to return a redirect response before links are 
resolved/followed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to