Re: Scores between words. Boosting?

Grant Ingersoll Mon, 16 Mar 2009 14:37:06 -0700

Yeah, I was going to suggest a combination of bi-grams and payloadsand the BoostingTermQuery. There is an NGram TokenFilter in contrib/analysis that can do the bi-gram part, but the payloads would be extra.

the piece I'm not sure about is how to handle the "synonyms" (theyaren't really, but for lack of a better word), i.e. get when the queryis "cat dog" also get those docs w/ just cat. You might be able to dosomething with a PrefixQuery on the n-grams or a separate field thatdoesn't do bigrams.


Still, that feels like a stretch for some reason.

-Grant


On Mar 16, 2009, at 3:39 PM, Babak Farhang wrote:

Since you're configuring/writing your own analyzer, why not generate a
token stream that emits bi-grams? Sure, you're expanding the number of
terms in the index, so there's some overhead there.  On the plus side,
however, your bi-grams, as you've described them, are ordered--which
reduces the potential # of bi-grams in your data set by a factor of
1/2.

-Babak

Tangent: Liat's example brings up an interesting issue about n-grams,
namely that indexing only internally sorted n-grams is a good strategy
for economizing on the number of terms in an index of n-grams--by a
factor of 1/n!, I think.  No?
On Mon, Mar 16, 2009 at 4:55 AM, liat oren <oren.l...@gmail.com>wrote:
Hi,
Is there any idea of how to make it work?
Many thanks,
Liat

2009/3/9 liat oren <oren.l...@gmail.com>
I have an index that has for every two words a score.
I would like my analyzer - that is a combination of whitespacetokenizer, a
stop words analyzer and stemming.
The regular score of Lucene takes into account the position of thewords.
I would like to add another factor to that score which is thesescore
between words.
Instead of having score 0 to words that are not equal, I wouldlike to use
this index in the calculation.

Is it better explained?

Thanks a lot,
Liat

2009/3/9 Grant Ingersoll <gsing...@apache.org>
Hmmm, I have some inklings of an idea, but can we take a stepback? Can
you explain the problem you are trying to solve at a higher level(instead
of the current solution)?  I imagine it is something related to
co-occurrence analysis.



On Mar 8, 2009, at 8:05 AM, liat oren wrote:

Hi Grant,
No, you can only have two words - the score is between two words.
"cat dog" and "dog cat" is equivalent, it will actually alwaysbe "cat
dog"
- going by alphabetic order.
About the boosting, I read a bit about it - but couldn't findhow it canhelp me, unless I change every appearance of the word dog tohave also
cat
and animal using the weight of the score.
So, for example, every word will appear 10 times from what it is- if
apple
appears 1, I will do the boosting so it appears 10 times.
If dog appears, then it will also have cat twice (0.2*10) andanimal 5
times(0.5*10).

But I hope to have another better solution.


Thanks
2009/3/8 Grant Ingersoll <gsing...@apache.org>

Hi Liat,
Some questions inline below.

On Mar 8, 2009, at 5:49 AM, liat oren wrote:

Hi,
I have scores between words, for example - dog and animal havea score
of
0.5 (and not 0), dog and cat have a score of 0.2, etc.
These scores are stored in an index:
Doc1: field words: dog animal
    field score: 0.5
Doc2: field words: dog cat
    field score: 0.2
If the user searches for the word dog - I would like thatdocuments
that
contain the word animal or cat will also get a good score(that will
take
into account the 0.5 and 0.2).
Is it always the case that these come in pairs? In otherwords, would
you
ever have:
field words: dog cat animal
score: 0.9
Also, is the following equivalent, or would it have a differentscore:
field words: cat dog
score: 0.2
Basically what I do is: for every document in the database, Iloop over
the
words that appear in the query (the query is long in a size ofan
article)
and for every word that appears in each document I take thescore from
the
index mentioned above and calculating a score between thequery and
each
document.
Any suggestion how to do it using Lucene search? How to addthese
values
to
the searcher?
Thinking...
I looked at the boosting option, but couldn't really see howit helps
me
to
that matter.
What "boosting option" did you look at? Can you explain a bitmore?
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Scores between words. Boosting?

Reply via email to