Stop words and stemming always make literal searching less precise, with the general benefit of greater matching power (more general) and smaller index size.
Where did the English stop word list come from? I feel as if I don't have enough info to judge if this is a good change or not. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server On 8/5/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
Author: dnaber Date: Sat Aug 5 06:11:09 2006 New Revision: 428998 URL: http://svn.apache.org/viewvc?rev=428998&view=rev Log: remove "s" and "t" as stopwords because they make searching less precise, e.g. "t-online" gives the same results as "online" with "t" being a stopword Modified: lucene/java/trunk/CHANGES.txt lucene/java/trunk/src/java/org/apache/lucene/analysis/StopAnalyzer.java lucene/java/trunk/src/test/org/apache/lucene/analysis/TestStandardAnalyzer.java Modified: lucene/java/trunk/CHANGES.txt URL: http://svn.apache.org/viewvc/lucene/java/trunk/CHANGES.txt?rev=428998&r1=428997&r2=428998&view=diff ============================================================================== --- lucene/java/trunk/CHANGES.txt (original) +++ lucene/java/trunk/CHANGES.txt Sat Aug 5 06:11:09 2006 @@ -4,6 +4,15 @@ Trunk (not yet released) +Changes in runtime behavior + + 1. 's' and 't' have been removed from the list of default stopwords + in StopAnalyzer (also used in by StandardAnalyzer). Having e.g. 's' + as a stopword meant that 's-class' led to the same results as 'class'. + Note that this problem still exists for 'a', e.g. in 'a-class' as + 'a' continues to be a stopword. + (Daniel Naber) + New features 1. LUCENE-503: New ThaiAnalyzer and ThaiWordFilter in contrib/analyzers Modified: lucene/java/trunk/src/java/org/apache/lucene/analysis/StopAnalyzer.java URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/analysis/StopAnalyzer.java?rev=428998&r1=428997&r2=428998&view=diff ============================================================================== --- lucene/java/trunk/src/java/org/apache/lucene/analysis/StopAnalyzer.java (original) +++ lucene/java/trunk/src/java/org/apache/lucene/analysis/StopAnalyzer.java Sat Aug 5 06:11:09 2006 @@ -31,8 +31,8 @@ public static final String[] ENGLISH_STOP_WORDS = { "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", - "no", "not", "of", "on", "or", "s", "such", - "t", "that", "the", "their", "then", "there", "these", + "no", "not", "of", "on", "or", "such", + "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with" }; Modified: lucene/java/trunk/src/test/org/apache/lucene/analysis/TestStandardAnalyzer.java URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/test/org/apache/lucene/analysis/TestStandardAnalyzer.java?rev=428998&r1=428997&r2=428998&view=diff ============================================================================== --- lucene/java/trunk/src/test/org/apache/lucene/analysis/TestStandardAnalyzer.java (original) +++ lucene/java/trunk/src/test/org/apache/lucene/analysis/TestStandardAnalyzer.java Sat Aug 5 06:11:09 2006 @@ -55,7 +55,17 @@ // possessives are actually removed by StardardFilter, not the tokenizer assertAnalyzesTo(a, "O'Reilly", new String[]{"o'reilly"}); assertAnalyzesTo(a, "you're", new String[]{"you're"}); + assertAnalyzesTo(a, "she's", new String[]{"she"}); + assertAnalyzesTo(a, "Jim's", new String[]{"jim"}); + assertAnalyzesTo(a, "don't", new String[]{"don't"}); assertAnalyzesTo(a, "O'Reilly's", new String[]{"o'reilly"}); + + // t and s had been stopwords in Lucene <= 2.0, which made it impossible + // to correctly search for these terms: + assertAnalyzesTo(a, "s-class", new String[]{"s", "class"}); + assertAnalyzesTo(a, "t-com", new String[]{"t", "com"}); + // 'a' is still a stopword: + assertAnalyzesTo(a, "a-class", new String[]{"class"}); // company names assertAnalyzesTo(a, "AT&T", new String[]{"at&t"});
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
