Author: kwright
Date: Wed Jan 9 12:31:04 2013
New Revision: 1430825
URL: http://svn.apache.org/viewvc?rev=1430825&view=rev
Log:
Fix for CONNECTORS-601. Redefine what a 'strange' character is to be better
compatible with CJK characters.
Modified:
manifoldcf/trunk/CHANGES.txt
manifoldcf/trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java
Modified: manifoldcf/trunk/CHANGES.txt
URL:
http://svn.apache.org/viewvc/manifoldcf/trunk/CHANGES.txt?rev=1430825&r1=1430824&r2=1430825&view=diff
==============================================================================
--- manifoldcf/trunk/CHANGES.txt (original)
+++ manifoldcf/trunk/CHANGES.txt Wed Jan 9 12:31:04 2013
@@ -3,6 +3,12 @@ $Id$
======================= 1.1-dev =====================
+CONNECTORS-601: Revise algorithm for screening out documents that
+are not text in the web connector. Since CJK characters mess up the
+old definition of "strange" character, use the more-limited definition of
+all non-whitespace characters less than 32.
+(Shinichiro Abe, Karl Wright)
+
CONNECTORS-600: Add a field to the RSS connector that contains
document origination date in ISO 8601 format.
(David Morana, Karl Wright)
Modified:
manifoldcf/trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java
URL:
http://svn.apache.org/viewvc/manifoldcf/trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java?rev=1430825&r1=1430824&r2=1430825&view=diff
==============================================================================
---
manifoldcf/trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java
(original)
+++
manifoldcf/trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java
Wed Jan 9 12:31:04 2013
@@ -6896,13 +6896,13 @@ public class WebcrawlerConnector extends
if (isStrange(x))
count++;
}
- return ((double)count)/((double)chunkLength) < 0.70;
+ return ((double)count)/((double)chunkLength) < 0.30;
}
- /** Check if character is not typical ASCII. */
+ /** Check if character is not typical ASCII or utf-8. */
protected static boolean isStrange(byte x)
{
- return (x > 127 || x < 32) && (!isWhiteSpace(x));
+ return (x < 32) && (!isWhiteSpace(x));
}
/** Check if a byte is a whitespace character. */