[Solr Wiki] Update of "LanguageAnalysis" by RobertMuir

Apache Wiki Wed, 14 Jul 2010 06:49:43 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The "LanguageAnalysis" page has been changed by RobertMuir.
The comment on this change is: docs for new stem factories.
http://wiki.apache.org/solr/LanguageAnalysis?action=diff&rev1=3&rev2=4

--------------------------------------------------

  <!> Note: See also {{{Decompounding}}} below.
  
  === English ===
- Solr includes two stemmers for English, the original Porter stemmer via 
{{{solr.PorterStemFilterFactory}}}, and the Porter2 stemmer via 
{{{solr.SnowballPorterFilterFactory}}}, as well as an example stopword list.
+ Solr includes three stemmers for English: the original Porter stemmer via 
{{{solr.PorterStemFilterFactory}}}, the Porter2 stemmer via 
{{{solr.SnowballPorterFilterFactory}}}, and a plural-only stemmer <!> 
[[Solr3.1]] via {{{solr.EnglishMinimalStemFilterFactory}}}. Lucene includes an 
example stopword list from the snowball project.
  
  {{{
  ...
@@ -120, +120 @@

  
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/english_stop.txt|stopwords]]
  
  === Finnish ===
- Solr includes support for stemming Finnish via 
{{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword 
list.
+ Solr includes two stemmers for Finnish: one via 
{{{solr.SnowballPorterFilterFactory}}}, and an alternative stemmer <!> 
[[Solr3.1]] via {{{solr.FinnishLightStemFilterFactory}}}. Lucene includes an 
example stopword list.
  
  {{{
  ...
@@ -130, +130 @@

  }}}
  
  Example set of Finnish 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/finnish_stop.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
+ 
  <!> Note: See also {{{Decompounding}}} below.
+ 
+ <!> Note: The Snowball stemmer for Finnish has known bugs, due to a bug in 
[[http://article.gmane.org/gmane.comp.search.snowball/1139|snowball itself]]
  
  === French ===
- Solr includes support for stemming French via 
{{{solr.SnowballPorterFilterFactory}}}, removing elisions via 
ElisionFilterFactory, and Lucene includes an example stopword list.
+ Solr includes three stemmers for French: one via 
{{{solr.SnowballPorterFilterFactory}}}, an alternative stemmer <!> [[Solr3.1]] 
via {{{solr.FrenchLightStemFilterFactory}}}, and an even less aggressive 
approach <!> [[Solr3.1]] via {{{solr.FrenchMinimalStemFilterFactory}}}. Solr 
can also removing elisions via {{{solr.ElisionFilterFactory}}}, and Lucene 
includes an example stopword list.
  
  {{{
  ...
@@ -149, +152 @@

  <!> Note: Its probably best to use the ElisionFilter before 
WordDelimiterFilter. This will prevent very slow phrase queries.
  
  === German ===
- Solr includes support for stemming German with three different algorithms: 
two via {{{solr.SnowballPorterFilterFactory}}}, and one via 
{{{solr.GermanStemFilterFactory}}}, and Lucene includes an example stopword 
list.
+ Solr includes support for stemming German with five different algorithms: two 
via {{{solr.SnowballPorterFilterFactory}}}, one via 
{{{solr.GermanStemFilterFactory}}}, a lightweight stemmer <!> [[Solr3.1]] via 
{{{solr.GermanLightStemFilterFactory}}}, and an even less aggressive approach 
<!> [[Solr3.1]] via {{{solr.GermanMinimalStemFilterFactory}}}. Lucene includes 
an example stopword list.
  
  With the {{{solr.SnowballPorterFilterFactory}}} you can supply two different 
language attributes: "German" and "German2". German2 is just a modified version 
of German that handles the umlaut characters differently: for example it treats 
"ü" as "ue" in most contexsts.
  
@@ -197, +200 @@

  
  === Hungarian ===
  
- Solr includes support for stemming Hungarian via 
{{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword 
list.
+ Solr includes two stemmers for Hungarian: one via 
{{{solr.SnowballPorterFilterFactory}}}, and an alternative stemmer <!> 
[[Solr3.1]] via {{{solr.HungarianLightStemFilterFactory}}}. Lucene includes an 
example stopword list.
  
  {{{
  ...
@@ -227, +230 @@

  Example set of Indonesian 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/id/stopwords.txt|stopwords]]
  
  === Italian ===
- Solr includes support for stemming Italian via 
{{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword 
list.
+ Solr includes two stemmers for Italian: one via 
{{{solr.SnowballPorterFilterFactory}}}, and an alternative stemmer <!> 
[[Solr3.1]] via {{{solr.ItalianLightStemFilterFactory}}}. Lucene includes an 
example stopword list.
  
  {{{
  ...
@@ -267, +270 @@

  <!> Note: WordDelimiterFilter does not split on joiners by default. You can 
solve this by using ArabicLetterTokenizerFactory, which does, or by using a 
custom WordDelimiterFilterFactory which supplies a customized charTypeTable to 
WordDelimiterFilter. In either case, consider using PositionFilter at 
query-time (only), as the QueryParser does not consider joiners and could 
create unwanted phrase queries.
  
  === Portuguese ===
- Solr includes support for stemming Portuguese via 
{{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword 
list.
+ Solr includes three stemmers for Portuguese: one via 
{{{solr.SnowballPorterFilterFactory}}}, an alternative stemmer <!> [[Solr3.1]] 
via {{{solr.PortugueseLightStemFilterFactory}}}, and an even less aggressive 
approach <!> [[Solr3.1]] via {{{solr.PortugueseMinimalStemFilterFactory}}}. 
Lucene includes an example stopword list.
  
  {{{
  ...
@@ -291, +294 @@

  Example set of Romanian 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ro/stopwords.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
  === Russian ===
- Solr includes support for stemming Russian via 
{{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword 
list.
+ Solr includes two stemmers for Russian: one via 
{{{solr.SnowballPorterFilterFactory}}}, and an alternative stemmer <!> 
[[Solr3.1]] via {{{solr.RussianLightStemFilterFactory}}}. Lucene includes an 
example stopword list.
  
  {{{
  ...
@@ -303, +306 @@

  Example set of Russian 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/russian_stop.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
  === Spanish ===
- Solr includes support for stemming Spanish via 
{{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword 
list.
+ Solr includes two stemmers for Spanish: one via 
{{{solr.SnowballPorterFilterFactory}}}, and an alternative stemmer <!> 
[[Solr3.1]] via {{{solr.SpanishLightStemFilterFactory}}}. Lucene includes an 
example stopword list.
  
  {{{
  ...
@@ -315, +318 @@

  Example set of Spanish 
[[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/spanish_stop.txt|stopwords]]
 (Be sure to switch your browser encoding to UTF-8)
  
  === Swedish ===
- Solr includes support for stemming Swedish via 
{{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword 
list.
+ Solr includes two stemmers for Swedish: one via 
{{{solr.SnowballPorterFilterFactory}}}, and an alternative stemmer <!> 
[[Solr3.1]] via {{{solr.SwedishLightStemFilterFactory}}}. Lucene includes an 
example stopword list.
  
  {{{
  ...
@@ -428, +431 @@

  
  There is no general rule for whether or not to stem: It depends not only on 
the language, but also on the properties of your documents and queries.
  
+ The snowball stemmers are considered fairly aggressive, but for many 
languages (see above) Solr provides alternatives that are less aggressive. In 
many situations a lighter approach yields better relevance: often "less is 
more". The light stemmers typically target the most common noun/adjective 
inflections, and perhaps a few derivational suffixes. The minimal stemmers are 
even more conservative and may only remove plural endings.
+ 
- In general, if the language is highly inflectional, its worth evaluating as 
it might bring a significant improvement. Some annoyances caused by stemming 
can then be handled with tuning: See {{{CustomizingStemming}}} below.
+ In general, if the language is highly inflectional, its worth evaluating at 
least a light stemmer as it might bring a significant improvement. Some 
annoyances caused by stemming can then be handled with tuning: See 
{{{CustomizingStemming}}} below.
  
  ==== Notes about solr.PorterStemFilterFactory ====

[Solr Wiki] Update of "LanguageAnalysis" by RobertMuir

Reply via email to