Dear Wiki user, You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The "LanguageAnalysis" page has been changed by RobertMuir. The comment on this change is: docs for new stem factories. http://wiki.apache.org/solr/LanguageAnalysis?action=diff&rev1=3&rev2=4 -------------------------------------------------- <!> Note: See also {{{Decompounding}}} below. === English === - Solr includes two stemmers for English, the original Porter stemmer via {{{solr.PorterStemFilterFactory}}}, and the Porter2 stemmer via {{{solr.SnowballPorterFilterFactory}}}, as well as an example stopword list. + Solr includes three stemmers for English: the original Porter stemmer via {{{solr.PorterStemFilterFactory}}}, the Porter2 stemmer via {{{solr.SnowballPorterFilterFactory}}}, and a plural-only stemmer <!> [[Solr3.1]] via {{{solr.EnglishMinimalStemFilterFactory}}}. Lucene includes an example stopword list from the snowball project. {{{ ... @@ -120, +120 @@ [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/english_stop.txt|stopwords]] === Finnish === - Solr includes support for stemming Finnish via {{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword list. + Solr includes two stemmers for Finnish: one via {{{solr.SnowballPorterFilterFactory}}}, and an alternative stemmer <!> [[Solr3.1]] via {{{solr.FinnishLightStemFilterFactory}}}. Lucene includes an example stopword list. {{{ ... @@ -130, +130 @@ }}} Example set of Finnish [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/finnish_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) + <!> Note: See also {{{Decompounding}}} below. + + <!> Note: The Snowball stemmer for Finnish has known bugs, due to a bug in [[http://article.gmane.org/gmane.comp.search.snowball/1139|snowball itself]] === French === - Solr includes support for stemming French via {{{solr.SnowballPorterFilterFactory}}}, removing elisions via ElisionFilterFactory, and Lucene includes an example stopword list. + Solr includes three stemmers for French: one via {{{solr.SnowballPorterFilterFactory}}}, an alternative stemmer <!> [[Solr3.1]] via {{{solr.FrenchLightStemFilterFactory}}}, and an even less aggressive approach <!> [[Solr3.1]] via {{{solr.FrenchMinimalStemFilterFactory}}}. Solr can also removing elisions via {{{solr.ElisionFilterFactory}}}, and Lucene includes an example stopword list. {{{ ... @@ -149, +152 @@ <!> Note: Its probably best to use the ElisionFilter before WordDelimiterFilter. This will prevent very slow phrase queries. === German === - Solr includes support for stemming German with three different algorithms: two via {{{solr.SnowballPorterFilterFactory}}}, and one via {{{solr.GermanStemFilterFactory}}}, and Lucene includes an example stopword list. + Solr includes support for stemming German with five different algorithms: two via {{{solr.SnowballPorterFilterFactory}}}, one via {{{solr.GermanStemFilterFactory}}}, a lightweight stemmer <!> [[Solr3.1]] via {{{solr.GermanLightStemFilterFactory}}}, and an even less aggressive approach <!> [[Solr3.1]] via {{{solr.GermanMinimalStemFilterFactory}}}. Lucene includes an example stopword list. With the {{{solr.SnowballPorterFilterFactory}}} you can supply two different language attributes: "German" and "German2". German2 is just a modified version of German that handles the umlaut characters differently: for example it treats "ΓΌ" as "ue" in most contexsts. @@ -197, +200 @@ === Hungarian === - Solr includes support for stemming Hungarian via {{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword list. + Solr includes two stemmers for Hungarian: one via {{{solr.SnowballPorterFilterFactory}}}, and an alternative stemmer <!> [[Solr3.1]] via {{{solr.HungarianLightStemFilterFactory}}}. Lucene includes an example stopword list. {{{ ... @@ -227, +230 @@ Example set of Indonesian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/id/stopwords.txt|stopwords]] === Italian === - Solr includes support for stemming Italian via {{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword list. + Solr includes two stemmers for Italian: one via {{{solr.SnowballPorterFilterFactory}}}, and an alternative stemmer <!> [[Solr3.1]] via {{{solr.ItalianLightStemFilterFactory}}}. Lucene includes an example stopword list. {{{ ... @@ -267, +270 @@ <!> Note: WordDelimiterFilter does not split on joiners by default. You can solve this by using ArabicLetterTokenizerFactory, which does, or by using a custom WordDelimiterFilterFactory which supplies a customized charTypeTable to WordDelimiterFilter. In either case, consider using PositionFilter at query-time (only), as the QueryParser does not consider joiners and could create unwanted phrase queries. === Portuguese === - Solr includes support for stemming Portuguese via {{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword list. + Solr includes three stemmers for Portuguese: one via {{{solr.SnowballPorterFilterFactory}}}, an alternative stemmer <!> [[Solr3.1]] via {{{solr.PortugueseLightStemFilterFactory}}}, and an even less aggressive approach <!> [[Solr3.1]] via {{{solr.PortugueseMinimalStemFilterFactory}}}. Lucene includes an example stopword list. {{{ ... @@ -291, +294 @@ Example set of Romanian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ro/stopwords.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) === Russian === - Solr includes support for stemming Russian via {{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword list. + Solr includes two stemmers for Russian: one via {{{solr.SnowballPorterFilterFactory}}}, and an alternative stemmer <!> [[Solr3.1]] via {{{solr.RussianLightStemFilterFactory}}}. Lucene includes an example stopword list. {{{ ... @@ -303, +306 @@ Example set of Russian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/russian_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) === Spanish === - Solr includes support for stemming Spanish via {{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword list. + Solr includes two stemmers for Spanish: one via {{{solr.SnowballPorterFilterFactory}}}, and an alternative stemmer <!> [[Solr3.1]] via {{{solr.SpanishLightStemFilterFactory}}}. Lucene includes an example stopword list. {{{ ... @@ -315, +318 @@ Example set of Spanish [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/spanish_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) === Swedish === - Solr includes support for stemming Swedish via {{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword list. + Solr includes two stemmers for Swedish: one via {{{solr.SnowballPorterFilterFactory}}}, and an alternative stemmer <!> [[Solr3.1]] via {{{solr.SwedishLightStemFilterFactory}}}. Lucene includes an example stopword list. {{{ ... @@ -428, +431 @@ There is no general rule for whether or not to stem: It depends not only on the language, but also on the properties of your documents and queries. + The snowball stemmers are considered fairly aggressive, but for many languages (see above) Solr provides alternatives that are less aggressive. In many situations a lighter approach yields better relevance: often "less is more". The light stemmers typically target the most common noun/adjective inflections, and perhaps a few derivational suffixes. The minimal stemmers are even more conservative and may only remove plural endings. + - In general, if the language is highly inflectional, its worth evaluating as it might bring a significant improvement. Some annoyances caused by stemming can then be handled with tuning: See {{{CustomizingStemming}}} below. + In general, if the language is highly inflectional, its worth evaluating at least a light stemmer as it might bring a significant improvement. Some annoyances caused by stemming can then be handled with tuning: See {{{CustomizingStemming}}} below. ==== Notes about solr.PorterStemFilterFactory ====