DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT <http://nagoya.apache.org/bugzilla/show_bug.cgi?id=12569>. ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND INSERTED IN THE BUG DATABASE.
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=12569 Uppercase/lowercase distinction in GermanStemmer not sustainable Summary: Uppercase/lowercase distinction in GermanStemmer not sustainable Product: Lucene Version: CVS Nightly - Specify date in submission Platform: All OS/Version: Other Status: NEW Severity: Normal Priority: Other Component: QueryParser AssignedTo: [EMAIL PROTECTED] ReportedBy: [EMAIL PROTECTED] I had a problem with the German stemmer, since it tries to detect nouns by looking for an uppercase first letter. This information is only used when a word ends with "t" in which case it is not stemmed. However, it's very naive to think words are nouns if and only if they begin with a capital letter. They may also be at the beginning with a sentence or within a quote, in which cases they may be set uppercase. Much worse, however, is the fact that most people write their queries in lower case. That means words can be stemmed differently in the query than in the index, leading to different results if someone entered the query in upper- or lowercase. Example: The word "Fakultäten" is stemmed to "fakultat", while "fakultäten" becomes "fakulta". I commented the lines out in GermanStemmer (see below, the diff is from CVS version 2002-09-05). However, I'm not enough a linguist to tell whether it is too much to stem a trailing "t" from a noun. Regards --Clemens --- GermanStemmer.java~1~ 2002-08-19 07:13:42.000000000 +0000 +++ GermanStemmer.java 2002-09-11 15:59:53.000000000 +0000 @@ -72,7 +72,7 @@ /** * Indicates if a term is handled as a noun. */ - private boolean uppercase = false; +// private boolean uppercase = false; /** * Amount of characters that are removed with <tt>substitute()</tt> while stemming. @@ -88,7 +88,9 @@ protected String stem( String term ) { // Mark a possible noun. - uppercase = Character.isUpperCase( term.charAt( 0 ) ); + /* uppercase = Character.isUpperCase( term.charAt( 0 ) ); + Can't use this - People don't use uppercase words in queries. --Clemens + */ // Use lowercase for medium stemming. term = term.toLowerCase(); if ( !isStemmable( term ) ) @@ -153,7 +155,7 @@ buffer.deleteCharAt( buffer.length() - 1 ); } // "t" occurs only as suffix of verbs. - else if ( buffer.charAt( buffer.length() - 1 ) == 't' && ! uppercase ) { + else if ( buffer.charAt( buffer.length() - 1 ) == 't' /*&& ! uppercase*/ ) { buffer.deleteCharAt( buffer.length() - 1 ); } else { -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>