Re: [Dspace-tech] searching issues for Spanish

Vezina Marie-Helene Wed, 29 Oct 2008 14:26:31 -0700

Alvaro,

We are running a bilingual (French/English) instance of Dspace (1.5.0). If I 
remember correctly, back when we were using version 1.3.2, we modified 
org.dspace.search.DSAnalyzer.java and added the ISOLatin1AccentFilter class as 
stated in the attached message so that a user could search for either 
"Montréal" or "Montreal" and always have the same results number. See for 
yourself: http://papyrus.bib.umontreal.ca


Regards,

Marie-Hélène Vézina
Bibliothécaire · Projets de bibliothèque numérique
Direction des bibliothèques, Université de Montréal


 

> -----Message d'origine-----
> De : Alvaro Sandoval [mailto:[EMAIL PROTECTED] 
> Envoyé : 29 octobre 2008 15:16
> À : [email protected]
> Objet : [Dspace-tech] searching issues for Spanish
> 
> Hi all:
> 
> I just found out that the default search analyzer that comes 
> with out-of-the-box dspace is English-based (which is obvious).
> 
> In the dspace.cfg file there is an example to change it for a 
> chinese analyzer 
> (org.apache.lucene.analysis.cn.ChineseAnalyzer). Is there any 
> option for Spanish language? I coudn't find an Spanish 
> analyzer in lucene page:
> http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/anal
ysis/Analyzer.html
> 
> We have at least 2 problems:
> - Stemming. We find different hits searching "Energía" v/s 
> "Energia", and we would expect the same result.
> - Stop words. We have irrelevant words like "a", "al", "de", 
> "del". And we would like to exclude them from the searching 
> process, unless the user wanted exact search. How can I 
> define those stop words?
> 
> Regards,
> 
> --
> Álvaro Sandoval
> BCN, Biblioteca del Congreso Nacional
> Servicios Digitales. Ingeniería y Desarrollo
> Fono: (56-32) 226 3981. Fax: (56-32) 226 3973 www.bcn.cl
> 
> 
> --------------------------------------------------------------
> -----------
> This SF.Net email is sponsored by the Moblin Your Move 
> Developer's challenge
> Build the coolest Linux based applications with Moblin SDK & 
> win great prizes
> Grand prize is a trip for two to an Open Source event 
> anywhere in the world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> DSpace-tech mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>

--- Begin Message ---

Hi everybody,

  I'm computer and I work at the UPC (Technical University of Catalonia)
  Libraries in Barcelona (Spain). We've installed the Dspace and since we
  began to utilize it we have some problems with the search with
  diacritics. The Dspace's searcher doesn't find words with accented
  characters when the search doesn't include these accents.
  To resolve this problem we've included to Dspace's searcher a new filter
  that transforms accented characters in the ISO Latin 1 character set to
  their unaccented counterparts.

  The changes that we have made are:

  1) Modify the org.dspace.search.DSAnalizer.java
      /*
       * Create a token stream for this analyzer.
       */
      public final TokenStream tokenStream(String fieldName, final Reader
  reader)
      {
          TokenStream result = new DSTokenizer(reader);

          result = new StandardFilter  (result);
          result = new LowerCaseFilter (result);
          result = new StopFilter      (result, stopTable);
          result = new PorterStemFilter(result);
  +      result = new ISOLatin1AccentFilter(result);

          return result;
      }

  2) Include the class ISOLatin1AccentFilter to package org.dspace.search.

  This class is based on another class ISOLatin1Filter (package
  fr.gouv.culture.sdx.search.lucene.analysis.filter) from a project of
  French Government that can be downloaded at http://adnx.org/sdx/

  If somebody has the same problem, I hope this message could help you.


  Best,


/** ISOLatin1AccentFilter.java **/

package org.dspace.search;

import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import java.io.IOException;

/**
 * A filter that transforms accented characters in the ISO Latin 1
character set to their unaccented counterparts.
 *
 * <p>
 * For example, the letter 'é' will be converted to 'e'.
 * <p>
 * This filter does'nt change the character case. If one wants to lowercase
letters, it should
 * also use another filter.
 */
public final class ISOLatin1AccentFilter extends TokenFilter {
  public ISOLatin1AccentFilter(TokenStream in) {
    super(in);
  }

  public final Token next() throws IOException {
      Token t = input.next();

      if (t == null)    return null;

        String tokenText = t.termText();
        StringBuffer chars = new StringBuffer();

        // Loop over the characters, replace those that need to be.
        for (int i = 0; i < tokenText.length(); i++) {
            switch (new Character(tokenText.charAt(i)).hashCode()) {
                case 192: //À
                case 193: //Á
                case 194: //Â
                case 195: //Ã
                case 196: //Ä
                case 197: //Å
                    chars.append("A");
                    break;
                case 198: //Æ
                    chars.append("AE");
                    break;
                case 199: //Ç
                    chars.append("C");
                    break;
                case 200: //È
                case 201: //É
                case 202: //Ê
                case 203: //Ë
                    chars.append("E");
                    break;
                case 204: //Ì
                case 205: //Í
                case 206: //Î
                case 207: //Ï
                    chars.append("I");
                    break;
                case 208: //Ð
                    chars.append("D");
                    break;
                case 209: //Ñ
                    chars.append("N");
                    break;
                case 210: //Ò
                case 211: //Ó
                case 212: //Ô
                case 213: //Õ
                case 214: //Ö
                case 216: //Ø
                    chars.append("O");
                    break;
                case 140: //?
                    chars.append("OE");
                    break;
                case 222: //Þ
                    chars.append("TH");
                    break;
                case 217: //Ù
                case 218: //Ú
                case 219: //Û
                case 220: //Ü
                    chars.append("U");
                    break;
                case 221: //Ý
                case 159: //?
                    chars.append("Y");
                    break;
                case 224: //à
                case 225: //á
                case 226: //â
                case 227: //ã
                case 228: //ä
                case 229: //å
                    chars.append("a");
                    break;
                case 230: //æ
                    chars.append("ae");
                    break;
                case 231: //ç
                    chars.append("c");
                    break;
                case 232: //è
                case 233: //é
                case 234: //ê
                case 235: //ë
                    chars.append("e");
                    break;
                case 236: //ì
                case 237: //í
                case 238: //î
                case 239: //ï
                    chars.append("i");
                    break;
                case 240: //ð
                    chars.append("d");
                    break;
                case 241: //ñ
                    chars.append("n");
                    break;
                case 242: //ò
                case 243: //ó
                case 244: //ô
                case 245: //õ
                case 246: //ö
                case 248: //ø
                    chars.append("o");
                    break;
                case 156: //?
                    chars.append("oe");
                    break;
                case 223: //ß
                    chars.append("ss");
                    break;
                case 254: //þ
                    chars.append("th");
                    break;
                case 249: //ù
                case 250: //ú
                case 251: //û
                case 252: //ü
                    chars.append("u");
                    break;
                case 253: //ý
                case 255: //ÿ
                    chars.append("y");
                    break;
                default:
                    chars.append(tokenText.charAt(i));
                    break;
            }
        }
        // Finally we return a new token with transformed characters.
        return new Token(chars.toString(), t.startOffset(), t.endOffset(),
t.type());
  }
}




-------------------------------------------------------
This SF.Net email is sponsored by Oracle Space Sweepstakes
Want to be the first software developer in space?
Enter now for the Oracle Space Sweepstakes!
http://ads.osdn.com/?ad_idt12&alloc_id344&op=ick
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

--- End Message ---

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/

_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] searching issues for Spanish

Reply via email to