Hi, I am currently working on a Lucene module that makes use of controlled SKOS vocabularies (http://www.w3.org/TR/skos-primer/) during index and search time. It should work similar to Lucene's Wordnet contrib module, just with some extended SKOS-specific functionality (e.g., support for broader & narrower relationships). Work is still very much in progress; first results are available here: https://code.google.com/p/lucene-skos/
My custom SKOSAnalyzer already performs synonym expansion based on the labels defined in a given SKOS model. But now I have the problem that real-world thesauri often define (multi terms) synonyms for mult-term words. Here is an example that defines the abbreviation "UN" as synonym for "United Nations" <skos:Concept rdf:about="http://www.cs.univie.ac.at/thesaurus/concept/6"> <skos:prefLabel>United Nations</skos:prefLabel> <skos:altLabel>UN</skos:altLabel> </skos:Concept> At the end the analyzer should add the term UN at the right position in the index. Taking the example above, a sentence "I work for the United Nations" should appear in the index as 2: [work: 2-> 6] 5: [united nations: 15->29] [un: 15->29] ...so that a query "I work for the UN" also matches the document. What is the best solution to implement that. With a TokenFilter I can work through the sentence token by token (using incrementToken()) and check if there is a synonym available. How can I analyze token sequences in a given text? Do I need to implement a custom tokenizer that recognizes entities based on a given dictionary? I am grateful for any suggestions or advice. Thank you, Bernhard ______________________________________________________ Research Group Multimedia Information Systems Department of Distributed and Multimedia Systems Faculty of Computer Science University of Vienna Postal Address: Liebiggasse 4/3-4, 1010 Vienna, Austria Phone: +43 1 42 77 39635 Fax: +43 1 4277 39649 E-Mail: bernhard.haslho...@univie.ac.at WWW: http://www.cs.univie.ac.at/bernhard.haslhofer --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org