Hi,

I am currently working on a Lucene module that makes use of controlled SKOS 
vocabularies (http://www.w3.org/TR/skos-primer/) during index and search time. 
It should work similar to Lucene's Wordnet contrib module, just with some 
extended SKOS-specific functionality (e.g., support for broader & narrower 
relationships). Work is still very much in progress; first results are 
available here: https://code.google.com/p/lucene-skos/

My custom SKOSAnalyzer already performs synonym expansion based on the labels 
defined in a given SKOS model. But now I have the problem that real-world 
thesauri often define (multi terms) synonyms for mult-term words. Here is an 
example that defines the abbreviation "UN" as synonym for "United Nations"

<skos:Concept rdf:about="http://www.cs.univie.ac.at/thesaurus/concept/6";>
      <skos:prefLabel>United Nations</skos:prefLabel>
      <skos:altLabel>UN</skos:altLabel>
 </skos:Concept>

At the end the analyzer should add the term UN at the right position in the 
index. Taking the example above, a sentence "I work for the United Nations" 
should appear in the index as 

2: [work: 2-> 6]
5: [united nations: 15->29] [un: 15->29]

...so that a query "I work for the UN" also matches the document.

What is the best solution to implement that. With a TokenFilter I can work 
through the sentence token by token (using incrementToken()) and check if there 
is a synonym available. How can I analyze token sequences in a given text? Do I 
need to implement a custom tokenizer that recognizes entities based on a given 
dictionary?

I am grateful for any suggestions or advice.

Thank you,

Bernhard




______________________________________________________
Research Group Multimedia Information Systems
Department of Distributed and Multimedia Systems
Faculty of Computer Science
University of Vienna

Postal Address: Liebiggasse 4/3-4, 1010 Vienna, Austria
Phone: +43 1 42 77 39635 Fax: +43 1 4277 39649
E-Mail: bernhard.haslho...@univie.ac.at
WWW: http://www.cs.univie.ac.at/bernhard.haslhofer


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to