Lexicon access questions

eks dev Thu, 01 Jun 2006 03:10:52 -0700

We have faced the following use case:

In order to optimize performance and more importantly quality of search results 
we are forced to attach more attributes to particular words (Terms). Generic 
attributes like TF, IDF are usefull to model our "similarity" only up to some 
level.


Examples:
1. Is one Term first or last name, (e.g. we have comprehensive list of such 
words). This enables us to make smarter (faster and better queries) in case 
someone has multiple first names, it influences ranking...
2. Agreement weight and Disagreement weigt of some words is modelled 
diferently. 
3. Semantic classes of words influence ranking (if something verb or noun 
changes search strategy and ranking radically)

On top of that, we can afford to load all terms in memory, in order to alow 
fast string distance callculations and some limited pattern matching using some 
strange Trie-s. 

Today, we solve these things by implementing totally redundant data structures 
that keep some kind of map Term->ValuesObject, which is redundant to Lucene 
Lexicon storage. Instead of "one access gets all" we have two access terms 
using two diferent access paths, once using our dictionary and second time 
implicitly via Query or so... So we introduce performance/memory penalties. 
(Pls. do not forget, we need to access copy of analyzed document in order to 
attach "additional info" to Terms)

I guess we are not the only ones to face such a case, as increase in precision 
above TF/IDF can be only achieved by introducing some "domain semantics" where 
available. For this, "attaching" domain specific info to Term would be perfect 
solution. Also, enabling flexible implementations for Lexicon access could give 
us some flexibility (e.g. implementation in mg4j goes in that direction)

Could somebody imagine 2.x version of Lucene to have some Interface that needs 
to be implemented with clear contract, that would enable us to attach our 
implementation for accessing lexicon? 

Or even better, some hints how I can do it today :)




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Lexicon access questions

Reply via email to