Thanks Chuck, I have to try it with example (s).
Use case one: Documents: D1 == "John Doe" D2 == "sky scraper" D3 == "blue sky LTD" Imagine name "John" is ultra frequent => low IDF weight and "sky" is super low freq => very high weigt So Query: Q: "sky john" will give order: D2, D3, D1 Also imagine, I know (external knowledge) that "John" is personal name and its "importance" in Similarity calculus should be corrected by some boost due to this fact. So, what I do today is to Lookup in some Dictionary Map where I attach boost to this token (reformulate query to "sky john^250"). What I was proposing, is to be able to attach this boost (practically IDF correction of some tokens during indexing) to tokens during indexing. With this, I could spare one lookup in memory hungry Dictionary and reformulation of the Query. This example case is just introduction to the idea. This example is over-simplified and possible to solve by indexing the same token many times at the same position. Having this possible, things like SweetSpotSimilarity could be done as an optional offline task (adjust IDF curve). Second "problem" to store semantic TAGS per token looks definitly doable by your proposal, but I am heving problems to comprehend all noughty details (performance impact and expressive power) as I never tried that parts of Lucene. The quetion, when we are accessing Term from Lexicon anyhow for serching purposes (postings offset, freq), would it not be faster to attach this TAG info to the Term? The third issue I briefly mentioned. Use Case where Lexicon can be loaded completely in memory (not an unusal case these days) gives us some space to play with FuzzyQuery and make them really usful in terms of speed. I guess there could be also some other implementations that can work on disk as well. We currently deal with ca. 50Mio Docs collection (short documents) and all terms fit nicely in memory in TernarySearchTree that alows us to issue Term lookups "give me all Terms that have at most N edits" than we run our hand tuned Needlman-Wunsch (different costs for substitutions like in "hitec" vs "hitek"...)... I would say, nice feature for people with reasonably sized collections. Better way of doing it would be to have posibility for our implementation of the Dictionary to implement Lucene interface "Lexicon" which would provides Lucene with postings offset or whatever is needed for Lucene when you search for Term. Lucene today is great, this here is just "could we do beter" not a "can someone scratch my itch" ----- Original Message ---- From: Chuck Williams <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org Sent: Thursday, 1 June, 2006 7:05:27 PM Subject: Re: Lexicon access questions This approach comes to mind. You could model your semantic tags as tokens and index them at the same positions as the words or phrases to which they apply. This is particularly easy if you can integrate your taggers with your Analyzer. You would probably want to create one or more new Query subclasses to facilitate certain types of matching, making it easy to associate terms/phrases with different tags (e.g., OverlappingQuery). This approach would support generation of queries that are tag-dependent, but would not directly help using tags in a ranking algorithm for tag-independent queries. As an off-hand thought, you might be able to extend the idea to support this by naming your tags something like TERM_TAG where TERM is the term they apply to (best if the character used for '_' cannot occur in any term). Then something like a TaggedTermQuery could easily find the tags relevant to a term in the query and iterate their docs/positions in parallel with those of the term (rougly equilvaent to OverlappingQuery(term, PrefixQuery(term_*))). Top-of-mind thoughts, Chuck eks dev wrote on 06/01/2006 12:10 AM: > We have faced the following use case: > > In order to optimize performance and more importantly quality of search > results we are forced to attach more attributes to particular words (Terms). > Generic attributes like TF, IDF are usefull to model our "similarity" only up > to some level. > > Examples: > 1. Is one Term first or last name, (e.g. we have comprehensive list of such > words). This enables us to make smarter (faster and better queries) in case > someone has multiple first names, it influences ranking... > 2. Agreement weight and Disagreement weigt of some words is modelled > diferently. > 3. Semantic classes of words influence ranking (if something verb or noun > changes search strategy and ranking radically) > > On top of that, we can afford to load all terms in memory, in order to alow > fast string distance callculations and some limited pattern matching using > some strange Trie-s. > > Today, we solve these things by implementing totally redundant data > structures that keep some kind of map Term->ValuesObject, which is redundant > to Lucene Lexicon storage. Instead of "one access gets all" we have two > access terms using two diferent access paths, once using our dictionary and > second time implicitly via Query or so... So we introduce performance/memory > penalties. (Pls. do not forget, we need to access copy of analyzed document > in order to attach "additional info" to Terms) > > I guess we are not the only ones to face such a case, as increase in > precision above TF/IDF can be only achieved by introducing some "domain > semantics" where available. For this, "attaching" domain specific info to > Term would be perfect solution. Also, enabling flexible implementations for > Lexicon access could give us some flexibility (e.g. implementation in mg4j > goes in that direction) > > Could somebody imagine 2.x version of Lucene to have some Interface that > needs to be implemented with clear contract, that would enable us to attach > our implementation for accessing lexicon? > > Or even better, some hints how I can do it today :) > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]