sebastian-nagel commented on PR #845: URL: https://github.com/apache/nutch/pull/845#issuecomment-3054084983
> intranets or users interested to index 'their' content, be it on local or remote servers will need authority data to be preserved @HiranChaudhuri, I understand your argument. Thanks! However, let's keep it simple here. If the authority parts are really required, the simple solution would be to disable the basic URL normalizer by removing the plugin from the `plugin.includes`. When crawling the intranet or a specific site, strict URL normalization is less a requirement than for a broad web crawl. Of course, this means you need to have a separate configuration for the intranet crawl. But from my experience, this is often already necessary because of other specific configuration options, e.g. a different revisit schedule, etc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org