Since you're configuring/writing your own analyzer, why not generate a token stream that emits bi-grams? Sure, you're expanding the number of terms in the index, so there's some overhead there. On the plus side, however, your bi-grams, as you've described them, are ordered--which reduces the potential # of bi-grams in your data set by a factor of 1/2.
-Babak Tangent: Liat's example brings up an interesting issue about n-grams, namely that indexing only internally sorted n-grams is a good strategy for economizing on the number of terms in an index of n-grams--by a factor of 1/n!, I think. No? On Mon, Mar 16, 2009 at 4:55 AM, liat oren <oren.l...@gmail.com> wrote: > Hi, > Is there any idea of how to make it work? > Many thanks, > Liat > > 2009/3/9 liat oren <oren.l...@gmail.com> > >> I have an index that has for every two words a score. >> I would like my analyzer - that is a combination of whitespace tokenizer, a >> stop words analyzer and stemming. >> >> The regular score of Lucene takes into account the position of the words. >> >> I would like to add another factor to that score which is these score >> between words. >> Instead of having score 0 to words that are not equal, I would like to use >> this index in the calculation. >> >> Is it better explained? >> >> Thanks a lot, >> Liat >> >> 2009/3/9 Grant Ingersoll <gsing...@apache.org> >> >> Hmmm, I have some inklings of an idea, but can we take a step back? Can >>> you explain the problem you are trying to solve at a higher level (instead >>> of the current solution)? I imagine it is something related to >>> co-occurrence analysis. >>> >>> >>> >>> On Mar 8, 2009, at 8:05 AM, liat oren wrote: >>> >>> Hi Grant, >>>> >>>> No, you can only have two words - the score is between two words. >>>> >>>> "cat dog" and "dog cat" is equivalent, it will actually always be "cat >>>> dog" >>>> - going by alphabetic order. >>>> >>>> About the boosting, I read a bit about it - but couldn't find how it can >>>> help me, unless I change every appearance of the word dog to have also >>>> cat >>>> and animal using the weight of the score. >>>> So, for example, every word will appear 10 times from what it is - if >>>> apple >>>> appears 1, I will do the boosting so it appears 10 times. >>>> If dog appears, then it will also have cat twice (0.2*10) and animal 5 >>>> times(0.5*10). >>>> >>>> But I hope to have another better solution. >>>> >>>> >>>> Thanks >>>> 2009/3/8 Grant Ingersoll <gsing...@apache.org> >>>> >>>> Hi Liat, >>>>> >>>>> Some questions inline below. >>>>> >>>>> On Mar 8, 2009, at 5:49 AM, liat oren wrote: >>>>> >>>>> Hi, >>>>> >>>>>> >>>>>> I have scores between words, for example - dog and animal have a score >>>>>> of >>>>>> 0.5 (and not 0), dog and cat have a score of 0.2, etc. >>>>>> These scores are stored in an index: >>>>>> Doc1: field words: dog animal >>>>>> field score: 0.5 >>>>>> Doc2: field words: dog cat >>>>>> field score: 0.2 >>>>>> >>>>>> If the user searches for the word dog - I would like that documents >>>>>> that >>>>>> contain the word animal or cat will also get a good score (that will >>>>>> take >>>>>> into account the 0.5 and 0.2). >>>>>> >>>>>> >>>>> Is it always the case that these come in pairs? In other words, would >>>>> you >>>>> ever have: >>>>> field words: dog cat animal >>>>> score: 0.9 >>>>> >>>>> Also, is the following equivalent, or would it have a different score: >>>>> field words: cat dog >>>>> score: 0.2 >>>>> >>>>> >>>>> >>>>> >>>>>> Basically what I do is: for every document in the database, I loop over >>>>>> the >>>>>> words that appear in the query (the query is long in a size of an >>>>>> article) >>>>>> and for every word that appears in each document I take the score from >>>>>> the >>>>>> index mentioned above and calculating a score between the query and >>>>>> each >>>>>> document. >>>>>> >>>>>> Any suggestion how to do it using Lucene search? How to add these >>>>>> values >>>>>> to >>>>>> the searcher? >>>>>> >>>>>> >>>>> Thinking... >>>>> >>>>> >>>>> >>>>>> I looked at the boosting option, but couldn't really see how it helps >>>>>> me >>>>>> to >>>>>> that matter. >>>>>> >>>>>> >>>>> What "boosting option" did you look at? Can you explain a bit more? >>>>> >>>>> >>>>> -------------------------- >>>>> Grant Ingersoll >>>>> http://www.lucidimagination.com/ >>>>> >>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using >>>>> Solr/Lucene: >>>>> http://www.lucidimagination.com/search >>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>> >>>>> >>>>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org