HiranChaudhuri commented on PR #845: URL: https://github.com/apache/nutch/pull/845#issuecomment-2522247755
Does it make sense to decide stripping authority data based on the protocol? I acknowledge most users want to scan the internet anonymously. But intranets or users interested to index 'their' content, be it on local or remote servers will need authority data to be preserved while they have no control over the protocol. Thus I suspect sometimes it may be required even though https is used. How about making it configurable, maybe via regexp? This would allow Nutch users to define the protocol, or the site or ... where to preserve the authority. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org