It's easy to write analyzers, you basically chain together a few
TokenFilters and call it a day. And to back up that statement I provide
an example spanish analyzer written by someone who basically threw his
complete Spanish vocabulary into the stop word list. DictionaryLoader is
a class which loads your hunspell dictionaries (.aff and .dic files)
from your storage (filesystem, embedded resources, etc). There are some
further development that can be done, like overriding/implementing
ReusableTokenStream and verify that the filters are in the correct order.
using System;
using System.Collections;
using System.IO;
using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Hunspell;
using Lucene.Net.Analysis.Standard;
using Version = Lucene.Net.Util.Version;
public class SpanishHunspellAnalyzer : Analyzer {
private static readonly HunspellDictionary Dictionary =
DictionaryLoader.Load(@"es_ES");
private static readonly Hashtable Stopwords = new Hashtable {
{ "Me", null }, { "no", null }, { "habla", null }, { "espaƱol",
null }
};
public override TokenStream TokenStream(String fieldName,
TextReader reader) {
var stream = new StandardTokenizer(Version.LUCENE_29, reader);
TokenFilter filter = new LowerCaseFilter(stream);
filter = new HunspellStemFilter(filter, Dictionary);
filter = new StopFilter(true, filter, Stopwords, true);
return filter;
}
}
// Simon
On 2012-06-14 18:44, vicente garcia wrote:
Thank you Simon, you can specify a
"Raven.Database.Indexing.Collation.Cultures.EsCollationAnalyzer,
Raven.Database" but you can't perform full text search queries because
this index don't tokenize the content.
http://ravendb.net/docs/client-api/querying/static-indexes/customizing-results-order
I saw that there is not a SpanishAnalyzer, we only have a
SpanishStemmer, but I don't need an stammer, I need a spanish analyzer
with its stops words, etc.
Has someones another idea on how to index Spanish content?
Thank you very much :)
On Thu, Jun 14, 2012 at 4:59 PM, Simon Svensson<si...@devhost.se> wrote:
Welcome,
See Configuring index options[1] to specify a custom analyzer that can
handle spanish content.
A quick check shows that Contrib.Analyzers does not contain a spanish
analyzer. There is a SpanishStemmer available in the Snowball contrib. You
could also use a spanish hunspell dictionary for stemming[2].
// Simon
[1]
http://ravendb.net/docs/client-api/querying/static-indexes/configuring-index-options
[2] https://github.com/sisve/Lucene.Net.Analysis.Hunspell
On 2012-06-14 16:49, vicente garcia wrote:
Hi to all, this is my first mail to this list :)
I'd like to index spanish content in raven db, I have been searching a
lot, but I don't know how I can do it.
Could someones help me please?
Thanks :)