Yeah, I was going to suggest a combination of bi-grams and payloads and the BoostingTermQuery. There is an NGram TokenFilter in contrib/ analysis that can do the bi-gram part, but the payloads would be extra.

the piece I'm not sure about is how to handle the "synonyms" (they aren't really, but for lack of a better word), i.e. get when the query is "cat dog" also get those docs w/ just cat. You might be able to do something with a PrefixQuery on the n-grams or a separate field that doesn't do bigrams.

Still, that feels like a stretch for some reason.

-Grant


On Mar 16, 2009, at 3:39 PM, Babak Farhang wrote:

Since you're configuring/writing your own analyzer, why not generate a
token stream that emits bi-grams? Sure, you're expanding the number of
terms in the index, so there's some overhead there.  On the plus side,
however, your bi-grams, as you've described them, are ordered--which
reduces the potential # of bi-grams in your data set by a factor of
1/2.

-Babak

Tangent: Liat's example brings up an interesting issue about n-grams,
namely that indexing only internally sorted n-grams is a good strategy
for economizing on the number of terms in an index of n-grams--by a
factor of 1/n!, I think.  No?

On Mon, Mar 16, 2009 at 4:55 AM, liat oren <oren.l...@gmail.com> wrote:
Hi,
Is there any idea of how to make it work?
Many thanks,
Liat

2009/3/9 liat oren <oren.l...@gmail.com>

I have an index that has for every two words a score.
I would like my analyzer - that is a combination of whitespace tokenizer, a
stop words analyzer and stemming.

The regular score of Lucene takes into account the position of the words.

I would like to add another factor to that score which is these score
between words.
Instead of having score 0 to words that are not equal, I would like to use
this index in the calculation.

Is it better explained?

Thanks a lot,
Liat

2009/3/9 Grant Ingersoll <gsing...@apache.org>

Hmmm, I have some inklings of an idea, but can we take a step back? Can
you explain the problem you are trying to solve at a higher level (instead
of the current solution)?  I imagine it is something related to
co-occurrence analysis.



On Mar 8, 2009, at 8:05 AM, liat oren wrote:

Hi Grant,

No, you can only have two words - the score is between two words.

"cat dog" and "dog cat" is equivalent, it will actually always be "cat
dog"
- going by alphabetic order.

About the boosting, I read a bit about it - but couldn't find how it can help me, unless I change every appearance of the word dog to have also
cat
and animal using the weight of the score.
So, for example, every word will appear 10 times from what it is - if
apple
appears 1, I will do the boosting so it appears 10 times.
If dog appears, then it will also have cat twice (0.2*10) and animal 5
times(0.5*10).

But I hope to have another better solution.


Thanks
2009/3/8 Grant Ingersoll <gsing...@apache.org>

Hi Liat,

Some questions inline below.

On Mar 8, 2009, at 5:49 AM, liat oren wrote:

Hi,


I have scores between words, for example - dog and animal have a score
of
0.5 (and not 0), dog and cat have a score of 0.2, etc.
These scores are stored in an index:
Doc1: field words: dog animal
    field score: 0.5
Doc2: field words: dog cat
    field score: 0.2

If the user searches for the word dog - I would like that documents
that
contain the word animal or cat will also get a good score (that will
take
into account the 0.5 and 0.2).


Is it always the case that these come in pairs? In other words, would
you
ever have:
field words: dog cat animal
score: 0.9

Also, is the following equivalent, or would it have a different score:
field words: cat dog
score: 0.2




Basically what I do is: for every document in the database, I loop over
the
words that appear in the query (the query is long in a size of an
article)
and for every word that appears in each document I take the score from
the
index mentioned above and calculating a score between the query and
each
document.

Any suggestion how to do it using Lucene search? How to add these
values
to
the searcher?


Thinking...



I looked at the boosting option, but couldn't really see how it helps
me
to
that matter.


What "boosting option" did you look at? Can you explain a bit more?


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/ Droids) using
Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to