It's easy to write analyzers, you basically chain together a few TokenFilters and call it a day. And to back up that statement I provide an example spanish analyzer written by someone who basically threw his complete Spanish vocabulary into the stop word list. DictionaryLoader is a class which loads your hunspell dictionaries (.aff and .dic files) from your storage (filesystem, embedded resources, etc). There are some further development that can be done, like overriding/implementing ReusableTokenStream and verify that the filters are in the correct order.

using System;
using System.Collections;
using System.IO;
using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Hunspell;
using Lucene.Net.Analysis.Standard;
using Version = Lucene.Net.Util.Version;

public class SpanishHunspellAnalyzer : Analyzer {
private static readonly HunspellDictionary Dictionary = DictionaryLoader.Load(@"es_ES");
    private static readonly Hashtable Stopwords = new Hashtable {
{ "Me", null }, { "no", null }, { "habla", null }, { "espaƱol", null }
    };

public override TokenStream TokenStream(String fieldName, TextReader reader) {
        var stream = new StandardTokenizer(Version.LUCENE_29, reader);

        TokenFilter filter = new LowerCaseFilter(stream);
        filter = new HunspellStemFilter(filter, Dictionary);
        filter = new StopFilter(true, filter, Stopwords, true);
        return filter;
    }
}

// Simon

On 2012-06-14 18:44, vicente garcia wrote:
Thank you Simon, you can specify a
"Raven.Database.Indexing.Collation.Cultures.EsCollationAnalyzer,
Raven.Database" but you can't perform full text search queries because
this index don't tokenize the content.
http://ravendb.net/docs/client-api/querying/static-indexes/customizing-results-order

I saw that there is not a SpanishAnalyzer, we only have a
SpanishStemmer, but I don't need an stammer, I need a spanish analyzer
with its stops words, etc.

Has someones another idea on how to index Spanish content?

Thank you very much :)

On Thu, Jun 14, 2012 at 4:59 PM, Simon Svensson<si...@devhost.se>  wrote:
Welcome,

See Configuring index options[1] to specify a custom analyzer that can
handle spanish content.

A quick check shows that Contrib.Analyzers does not contain a spanish
analyzer. There is a SpanishStemmer available in the Snowball contrib. You
could also use a spanish hunspell dictionary for stemming[2].

// Simon

[1]
http://ravendb.net/docs/client-api/querying/static-indexes/configuring-index-options
[2] https://github.com/sisve/Lucene.Net.Analysis.Hunspell


On 2012-06-14 16:49, vicente garcia wrote:
Hi to all, this is my first mail to this list :)

I'd like to index spanish content in raven db, I have been searching a
lot, but I don't know how I can do it.

Could someones help me please?

Thanks :)



Reply via email to