sebastian-nagel commented on PR #845:
URL: https://github.com/apache/nutch/pull/845#issuecomment-3054084983

   > intranets or users interested to index 'their' content, be it on local or 
remote servers will need authority data to be preserved
   
   @HiranChaudhuri, I understand your argument. Thanks!
   
   However, let's keep it simple here. If the authority parts are really 
required, the simple solution would be to disable the basic URL normalizer by 
removing the plugin from the `plugin.includes`. When crawling the intranet or a 
specific site, strict URL normalization is less a requirement than for a broad 
web crawl. Of course, this means you need to have a separate configuration for 
the intranet crawl. But from my experience, this is often already necessary 
because of other specific configuration options, e.g. a different revisit 
schedule, etc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to