Ahh! forgot about the "synonym" (floabw) part of the problem. Take 2: how about unigram and bigram tokens in the same field? e.g. new NGramTokenizer(Reader, 1, 2)
The PrefixQuery strategy should be slower, I think, because the "cat" --> "cat dog" relationship is one-to-many, so there will be a lot of [bigram] terms to iterate over (and a lot of redundant hits). On Mon, Mar 16, 2009 at 3:36 PM, Grant Ingersoll <gsing...@apache.org> wrote: > Yeah, I was going to suggest a combination of bi-grams and payloads and the > BoostingTermQuery. There is an NGram TokenFilter in contrib/analysis that > can do the bi-gram part, but the payloads would be extra. > > the piece I'm not sure about is how to handle the "synonyms" (they aren't > really, but for lack of a better word), i.e. get when the query is "cat dog" > also get those docs w/ just cat. You might be able to do something with a > PrefixQuery on the n-grams or a separate field that doesn't do bigrams. > > Still, that feels like a stretch for some reason. > > -Grant > > > On Mar 16, 2009, at 3:39 PM, Babak Farhang wrote: > >> Since you're configuring/writing your own analyzer, why not generate a >> token stream that emits bi-grams? Sure, you're expanding the number of >> terms in the index, so there's some overhead there. On the plus side, >> however, your bi-grams, as you've described them, are ordered--which >> reduces the potential # of bi-grams in your data set by a factor of >> 1/2. >> >> -Babak >> >> Tangent: Liat's example brings up an interesting issue about n-grams, >> namely that indexing only internally sorted n-grams is a good strategy >> for economizing on the number of terms in an index of n-grams--by a >> factor of 1/n!, I think. No? >> >> On Mon, Mar 16, 2009 at 4:55 AM, liat oren <oren.l...@gmail.com> wrote: >>> >>> Hi, >>> Is there any idea of how to make it work? >>> Many thanks, >>> Liat >>> >>> 2009/3/9 liat oren <oren.l...@gmail.com> >>> >>>> I have an index that has for every two words a score. >>>> I would like my analyzer - that is a combination of whitespace >>>> tokenizer, a >>>> stop words analyzer and stemming. >>>> >>>> The regular score of Lucene takes into account the position of the >>>> words. >>>> >>>> I would like to add another factor to that score which is these score >>>> between words. >>>> Instead of having score 0 to words that are not equal, I would like to >>>> use >>>> this index in the calculation. >>>> >>>> Is it better explained? >>>> >>>> Thanks a lot, >>>> Liat >>>> >>>> 2009/3/9 Grant Ingersoll <gsing...@apache.org> >>>> >>>> Hmmm, I have some inklings of an idea, but can we take a step back? Can >>>>> >>>>> you explain the problem you are trying to solve at a higher level >>>>> (instead >>>>> of the current solution)? I imagine it is something related to >>>>> co-occurrence analysis. >>>>> >>>>> >>>>> >>>>> On Mar 8, 2009, at 8:05 AM, liat oren wrote: >>>>> >>>>> Hi Grant, >>>>>> >>>>>> No, you can only have two words - the score is between two words. >>>>>> >>>>>> "cat dog" and "dog cat" is equivalent, it will actually always be "cat >>>>>> dog" >>>>>> - going by alphabetic order. >>>>>> >>>>>> About the boosting, I read a bit about it - but couldn't find how it >>>>>> can >>>>>> help me, unless I change every appearance of the word dog to have also >>>>>> cat >>>>>> and animal using the weight of the score. >>>>>> So, for example, every word will appear 10 times from what it is - if >>>>>> apple >>>>>> appears 1, I will do the boosting so it appears 10 times. >>>>>> If dog appears, then it will also have cat twice (0.2*10) and animal 5 >>>>>> times(0.5*10). >>>>>> >>>>>> But I hope to have another better solution. >>>>>> >>>>>> >>>>>> Thanks >>>>>> 2009/3/8 Grant Ingersoll <gsing...@apache.org> >>>>>> >>>>>> Hi Liat, >>>>>>> >>>>>>> Some questions inline below. >>>>>>> >>>>>>> On Mar 8, 2009, at 5:49 AM, liat oren wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>>> >>>>>>>> I have scores between words, for example - dog and animal have a >>>>>>>> score >>>>>>>> of >>>>>>>> 0.5 (and not 0), dog and cat have a score of 0.2, etc. >>>>>>>> These scores are stored in an index: >>>>>>>> Doc1: field words: dog animal >>>>>>>> field score: 0.5 >>>>>>>> Doc2: field words: dog cat >>>>>>>> field score: 0.2 >>>>>>>> >>>>>>>> If the user searches for the word dog - I would like that documents >>>>>>>> that >>>>>>>> contain the word animal or cat will also get a good score (that will >>>>>>>> take >>>>>>>> into account the 0.5 and 0.2). >>>>>>>> >>>>>>>> >>>>>>> Is it always the case that these come in pairs? In other words, >>>>>>> would >>>>>>> you >>>>>>> ever have: >>>>>>> field words: dog cat animal >>>>>>> score: 0.9 >>>>>>> >>>>>>> Also, is the following equivalent, or would it have a different >>>>>>> score: >>>>>>> field words: cat dog >>>>>>> score: 0.2 >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Basically what I do is: for every document in the database, I loop >>>>>>>> over >>>>>>>> the >>>>>>>> words that appear in the query (the query is long in a size of an >>>>>>>> article) >>>>>>>> and for every word that appears in each document I take the score >>>>>>>> from >>>>>>>> the >>>>>>>> index mentioned above and calculating a score between the query and >>>>>>>> each >>>>>>>> document. >>>>>>>> >>>>>>>> Any suggestion how to do it using Lucene search? How to add these >>>>>>>> values >>>>>>>> to >>>>>>>> the searcher? >>>>>>>> >>>>>>>> >>>>>>> Thinking... >>>>>>> >>>>>>> >>>>>>> >>>>>>>> I looked at the boosting option, but couldn't really see how it >>>>>>>> helps >>>>>>>> me >>>>>>>> to >>>>>>>> that matter. >>>>>>>> >>>>>>>> >>>>>>> What "boosting option" did you look at? Can you explain a bit more? >>>>>>> >>>>>>> >>>>>>> -------------------------- >>>>>>> Grant Ingersoll >>>>>>> http://www.lucidimagination.com/ >>>>>>> >>>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) >>>>>>> using >>>>>>> Solr/Lucene: >>>>>>> http://www.lucidimagination.com/search >>>>>>> >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>> >>>>> >>>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org