Stop words and stemming always make literal searching less precise,
with the general benefit of greater matching power (more general) and
smaller index size.

Where did the English stop word list come from?  I feel as if I don't
have enough info to judge if this is a good change or not.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server


On 8/5/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
Author: dnaber
Date: Sat Aug  5 06:11:09 2006
New Revision: 428998

URL: http://svn.apache.org/viewvc?rev=428998&view=rev
Log:
remove "s" and "t" as stopwords because they make searching less precise, e.g. "t-online" gives the 
same results as "online" with "t" being a stopword

Modified:
    lucene/java/trunk/CHANGES.txt
    lucene/java/trunk/src/java/org/apache/lucene/analysis/StopAnalyzer.java
    
lucene/java/trunk/src/test/org/apache/lucene/analysis/TestStandardAnalyzer.java

Modified: lucene/java/trunk/CHANGES.txt
URL: 
http://svn.apache.org/viewvc/lucene/java/trunk/CHANGES.txt?rev=428998&r1=428997&r2=428998&view=diff
==============================================================================
--- lucene/java/trunk/CHANGES.txt (original)
+++ lucene/java/trunk/CHANGES.txt Sat Aug  5 06:11:09 2006
@@ -4,6 +4,15 @@

 Trunk (not yet released)

+Changes in runtime behavior
+
+ 1. 's' and 't' have been removed from the list of default stopwords
+    in StopAnalyzer (also used in by StandardAnalyzer). Having e.g. 's'
+    as a stopword meant that 's-class' led to the same results as 'class'.
+    Note that this problem still exists for 'a', e.g. in 'a-class' as
+    'a' continues to be a stopword.
+    (Daniel Naber)
+
 New features

  1. LUCENE-503: New ThaiAnalyzer and ThaiWordFilter in contrib/analyzers

Modified: 
lucene/java/trunk/src/java/org/apache/lucene/analysis/StopAnalyzer.java
URL: 
http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/analysis/StopAnalyzer.java?rev=428998&r1=428997&r2=428998&view=diff
==============================================================================
--- lucene/java/trunk/src/java/org/apache/lucene/analysis/StopAnalyzer.java 
(original)
+++ lucene/java/trunk/src/java/org/apache/lucene/analysis/StopAnalyzer.java Sat 
Aug  5 06:11:09 2006
@@ -31,8 +31,8 @@
   public static final String[] ENGLISH_STOP_WORDS = {
     "a", "an", "and", "are", "as", "at", "be", "but", "by",
     "for", "if", "in", "into", "is", "it",
-    "no", "not", "of", "on", "or", "s", "such",
-    "t", "that", "the", "their", "then", "there", "these",
+    "no", "not", "of", "on", "or", "such",
+    "that", "the", "their", "then", "there", "these",
     "they", "this", "to", "was", "will", "with"
   };


Modified: 
lucene/java/trunk/src/test/org/apache/lucene/analysis/TestStandardAnalyzer.java
URL: 
http://svn.apache.org/viewvc/lucene/java/trunk/src/test/org/apache/lucene/analysis/TestStandardAnalyzer.java?rev=428998&r1=428997&r2=428998&view=diff
==============================================================================
--- 
lucene/java/trunk/src/test/org/apache/lucene/analysis/TestStandardAnalyzer.java 
(original)
+++ 
lucene/java/trunk/src/test/org/apache/lucene/analysis/TestStandardAnalyzer.java 
Sat Aug  5 06:11:09 2006
@@ -55,7 +55,17 @@
     // possessives are actually removed by StardardFilter, not the tokenizer
     assertAnalyzesTo(a, "O'Reilly", new String[]{"o'reilly"});
     assertAnalyzesTo(a, "you're", new String[]{"you're"});
+    assertAnalyzesTo(a, "she's", new String[]{"she"});
+    assertAnalyzesTo(a, "Jim's", new String[]{"jim"});
+    assertAnalyzesTo(a, "don't", new String[]{"don't"});
     assertAnalyzesTo(a, "O'Reilly's", new String[]{"o'reilly"});
+
+    // t and s had been stopwords in Lucene <= 2.0, which made it impossible
+    // to correctly search for these terms:
+    assertAnalyzesTo(a, "s-class", new String[]{"s", "class"});
+    assertAnalyzesTo(a, "t-com", new String[]{"t", "com"});
+    // 'a' is still a stopword:
+    assertAnalyzesTo(a, "a-class", new String[]{"class"});

     // company names
     assertAnalyzesTo(a, "AT&T", new String[]{"at&t"});

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to