[ https://issues.apache.org/jira/browse/NUTCH-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-1885: ----------------------------------- Attachment: NUTCH-1885-trunk-v1.patch Patch for trunk (works probably also for 2.x): old behavior can be restored if property "file.crawl.redirect_noncanonical" == false. > Protocol-file should treat symbolic links as redirects > ------------------------------------------------------ > > Key: NUTCH-1885 > URL: https://issues.apache.org/jira/browse/NUTCH-1885 > Project: Nutch > Issue Type: Sub-task > Components: protocol > Affects Versions: 1.9, 2.2.1 > Reporter: Sebastian Nagel > Priority: Minor > Fix For: 2.3, 1.10 > > Attachments: NUTCH-1885-trunk-v1.patch > > > (reported by [~angela_wang], see NUTCH-1884, > [[1|https://www.mail-archive.com/dev@nutch.apache.org/msg15614.html]] and > [[2|https://www.mail-archive.com/dev@nutch.apache.org/msg15610.html]]) > If a file is a symbolic link or contains a link on it's path:, protocol-file > follows the link immediately and returns a Content object with the canonical > path (all symbolic links resolved) in field "Location". This may cause > - the Parse object not available under its expected URL (see NUTCH-1884) > - dubious CrawlDatums (status fetched!) in CrawlDb (first URL is a symbolic > link to second item): > {noformat} > file:/var/www/redir_test.html Version: 7 > Status: 2 (db_fetched) > ... > Signature: null > Metadata: > Content-Type=text/html > _pst_=success(1), lastModified=0 > file:/var/www/test.html Version: 7 > Status: 2 (db_fetched) > ... > Signature: 50fa8436398f0ecb6b15eaba0574ef23 > Metadata: > Content-Type=text/html > _pst_=success(1), lastModified=0 > {noformat} > Because signature is null these will never result in duplicates in index. > Protocol-file should instead explicitly redirect to the link target. This > should be the default, optionally we could add a property to restore the old > behavior. > Should not be difficult to resolve: FileResponse already has status > "redirect" for symlinks, but File.getProtocolOutput() then resolves the links > internally. So we just need to return a redirect response before links are > resolved/followed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)