Author: snagel
Date: Wed Oct 10 21:16:09 2012
New Revision: 1396801
URL: http://svn.apache.org/viewvc?rev=1396801&view=rev
Log:
NUTCH-1344 BasicURLNormalizer to normalize https same as http
Modified:
nutch/trunk/CHANGES.txt
nutch/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java
Modified: nutch/trunk/CHANGES.txt
URL:
http://svn.apache.org/viewvc/nutch/trunk/CHANGES.txt?rev=1396801&r1=1396800&r2=1396801&view=diff
==============================================================================
--- nutch/trunk/CHANGES.txt (original)
+++ nutch/trunk/CHANGES.txt Wed Oct 10 21:16:09 2012
@@ -2,6 +2,8 @@ Nutch Change Log
(trunk) Current Development:
+* NUTCH-1344 BasicURLNormalizer to normalize https same as http
+
* NUTCH-706 Url regex normalizer: pattern for session id removal not to match
"newsId" (Meghna Kukreja via snagel)
* NUTCH-1415 release packages to contain top level folder apache-nutch-x.x
(snagel)
Modified:
nutch/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java
URL:
http://svn.apache.org/viewvc/nutch/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java?rev=1396801&r1=1396800&r2=1396801&view=diff
==============================================================================
---
nutch/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java
(original)
+++
nutch/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java
Wed Oct 10 21:16:09 2012
@@ -104,7 +104,7 @@ public class BasicURLNormalizer extends
if (!urlString.startsWith(protocol)) // protocol was lowercased
changed = true;
- if ("http".equals(protocol) || "ftp".equals(protocol)) {
+ if ("http".equals(protocol) || "https".equals(protocol) ||
"ftp".equals(protocol)) {
if (host != null) {
String newHost = host.toLowerCase(); // lowercase host