Lingustically-enhanced indexing for Lucene

fsanchez Tue, 12 Feb 2008 03:26:18 -0800

The Transducens Group (http://transducens.dlsi.ua.es) at University 
of Alicante (http://www.ua.es) has developed a tool that
allows the Lucene search engine to use morphological information
while indexing and then process smarter queries in which
morphological attributes can be used to specify query terms.


To that end, the tool makes use of morphological analyzers and
dictionaries developed for the open-source machine translation platform
Apertium (http://apertium.org) and, optionally, the part-of-speech
taggers developed for it. Currently there are morphological 
dictionaries available for Spanish, Catalan, Galician, Portuguese, 
Aranese, Romanian, French and English. In addition new dictionaries 
are being developed for Esperanto, Occitan, Basque, Swedish, Danish, 
Welsh, Polish and Italian, among others; we hope more language pairs 
to be added to the Apertium machine translation platform in the 
near future.

We are interested on releasing this tool as open source and we think 
that the best way to do that would be to integrate it into the Lucene's
contrib folder, as other third-party tools. Who is the responsible 
for that?, To whom should we address this petition?

Thank you very much.

========================== How it works ==========================

Indexing documents through this new framework involves the following
steps: 

1. The texts to index must be analyzed using the morphological analyzer
and (optionally) the part-of-speech taggers of the Apertium machine
translation platform. Apertium supports files in plain text, rtf, odt,
sxw, html and doc.

2. Indexing the documents, as usual, by using a Lucene's analyzer
developed ad-hoc so as to properly interpret the documents previously
analyzed.

During indexing, the following morphological information is obtained for
each word: superficial form (the word as it appears in a non-analyzed
text), its lemma and relevant morphological information such as
part-of-speech and verb tense (if appropriate). The following example
illustrates which information is stored in the index for the following
English phrase "Blair does not resign":

* "Blair" 
   - Superficial form: blair
   - Lemma: blair
   - Morphological information: np.ant (noun of a person) 

* "does" 
   - Superficial form: does
   - Lemma: do
   - Morphological information: vbdo.pri (auxiliar verb, present tense) 

* "not" 
   - Superficial form: no 
   - Lemma: no 
   - Morphological information: adv (adverb) 

* "resign" 
   - Superficial form: resign 
   - Lemma: resign 
   - Morphological information: vblex.inf (verb, infinitive tense) 

To search, the language accepted by the query parser can be applied,
provided that a WhitespaceAnalyzer is used. In the query one can specify
information of different nature, to that end the following prefixes
are used:
- "sf:" for the superficial form (eg "sf:resign") 
- "lem:" for the lema (eg "lem:resign") 
- "tags:" for the morphological information (eg "tags:vblex.inf") 

The following example illustrates the type of queries that can be used
to search for an specific document: 

- Query: "tags:np.loc lem:airline sf:with lem:destination tags:np.loc" 

This query searches for documents in which there is an airline or more
flying from anywhere to elsewhere, for example "Argentine airlines with 
destination Madrid" or "British airlines with destination New York"


-- 
Felipe Sánchez Martínez <[EMAIL PROTECTED]>
Departamento de Lenguajes y Sistemas Informáticos
Universidad de Alicante, E-03071 Alicante (Spain)
Tel.: +34 965 903 400, ext: 2038 Fax: +34 965 909 326
http://www.dlsi.ua.es/~fsanchez


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Lingustically-enhanced indexing for Lucene

Reply via email to