Author: kwright
Date: Wed Jan  9 12:31:04 2013
New Revision: 1430825

URL: http://svn.apache.org/viewvc?rev=1430825&view=rev
Log:
Fix for CONNECTORS-601.  Redefine what a 'strange' character is to be better 
compatible with CJK characters.

Modified:
    manifoldcf/trunk/CHANGES.txt
    
manifoldcf/trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java

Modified: manifoldcf/trunk/CHANGES.txt
URL: 
http://svn.apache.org/viewvc/manifoldcf/trunk/CHANGES.txt?rev=1430825&r1=1430824&r2=1430825&view=diff
==============================================================================
--- manifoldcf/trunk/CHANGES.txt (original)
+++ manifoldcf/trunk/CHANGES.txt Wed Jan  9 12:31:04 2013
@@ -3,6 +3,12 @@ $Id$
 
 ======================= 1.1-dev =====================
 
+CONNECTORS-601: Revise algorithm for screening out documents that
+are not text in the web connector.  Since CJK characters mess up the
+old definition of "strange" character, use the more-limited definition of
+all non-whitespace characters less than 32.
+(Shinichiro Abe, Karl Wright)
+
 CONNECTORS-600: Add a field to the RSS connector that contains
 document origination date in ISO 8601 format.
 (David Morana, Karl Wright)

Modified: 
manifoldcf/trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java
URL: 
http://svn.apache.org/viewvc/manifoldcf/trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java?rev=1430825&r1=1430824&r2=1430825&view=diff
==============================================================================
--- 
manifoldcf/trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java
 (original)
+++ 
manifoldcf/trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java
 Wed Jan  9 12:31:04 2013
@@ -6896,13 +6896,13 @@ public class WebcrawlerConnector extends
       if (isStrange(x))
         count++;
     }
-    return ((double)count)/((double)chunkLength) < 0.70;
+    return ((double)count)/((double)chunkLength) < 0.30;
   }
 
-  /** Check if character is not typical ASCII. */
+  /** Check if character is not typical ASCII or utf-8. */
   protected static boolean isStrange(byte x)
   {
-    return (x > 127 || x < 32) && (!isWhiteSpace(x));
+    return (x < 32) && (!isWhiteSpace(x));
   }
 
   /** Check if a byte is a whitespace character. */


Reply via email to