The Transducens Group (http://transducens.dlsi.ua.es) at University of Alicante (http://www.ua.es) has developed a tool that allows the Lucene search engine to use morphological information while indexing and then process smarter queries in which morphological attributes can be used to specify query terms.
To that end, the tool makes use of morphological analyzers and dictionaries developed for the open-source machine translation platform Apertium (http://apertium.org) and, optionally, the part-of-speech taggers developed for it. Currently there are morphological dictionaries available for Spanish, Catalan, Galician, Portuguese, Aranese, Romanian, French and English. In addition new dictionaries are being developed for Esperanto, Occitan, Basque, Swedish, Danish, Welsh, Polish and Italian, among others; we hope more language pairs to be added to the Apertium machine translation platform in the near future. We are interested on releasing this tool as open source and we think that the best way to do that would be to integrate it into the Lucene's contrib folder, as other third-party tools. Who is the responsible for that?, To whom should we address this petition? Thank you very much. ========================== How it works ========================== Indexing documents through this new framework involves the following steps: 1. The texts to index must be analyzed using the morphological analyzer and (optionally) the part-of-speech taggers of the Apertium machine translation platform. Apertium supports files in plain text, rtf, odt, sxw, html and doc. 2. Indexing the documents, as usual, by using a Lucene's analyzer developed ad-hoc so as to properly interpret the documents previously analyzed. During indexing, the following morphological information is obtained for each word: superficial form (the word as it appears in a non-analyzed text), its lemma and relevant morphological information such as part-of-speech and verb tense (if appropriate). The following example illustrates which information is stored in the index for the following English phrase "Blair does not resign": * "Blair" - Superficial form: blair - Lemma: blair - Morphological information: np.ant (noun of a person) * "does" - Superficial form: does - Lemma: do - Morphological information: vbdo.pri (auxiliar verb, present tense) * "not" - Superficial form: no - Lemma: no - Morphological information: adv (adverb) * "resign" - Superficial form: resign - Lemma: resign - Morphological information: vblex.inf (verb, infinitive tense) To search, the language accepted by the query parser can be applied, provided that a WhitespaceAnalyzer is used. In the query one can specify information of different nature, to that end the following prefixes are used: - "sf:" for the superficial form (eg "sf:resign") - "lem:" for the lema (eg "lem:resign") - "tags:" for the morphological information (eg "tags:vblex.inf") The following example illustrates the type of queries that can be used to search for an specific document: - Query: "tags:np.loc lem:airline sf:with lem:destination tags:np.loc" This query searches for documents in which there is an airline or more flying from anywhere to elsewhere, for example "Argentine airlines with destination Madrid" or "British airlines with destination New York" -- Felipe Sánchez Martínez <[EMAIL PROTECTED]> Departamento de Lenguajes y Sistemas Informáticos Universidad de Alicante, E-03071 Alicante (Spain) Tel.: +34 965 903 400, ext: 2038 Fax: +34 965 909 326 http://www.dlsi.ua.es/~fsanchez --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]