Yeah, I was going to suggest a combination of bi-grams and payloads
and the BoostingTermQuery. There is an NGram TokenFilter in contrib/
analysis that can do the bi-gram part, but the payloads would be extra.
the piece I'm not sure about is how to handle the "synonyms" (they
aren't really, but for lack of a better word), i.e. get when the query
is "cat dog" also get those docs w/ just cat. You might be able to do
something with a PrefixQuery on the n-grams or a separate field that
doesn't do bigrams.
Still, that feels like a stretch for some reason.
-Grant
On Mar 16, 2009, at 3:39 PM, Babak Farhang wrote:
Since you're configuring/writing your own analyzer, why not generate a
token stream that emits bi-grams? Sure, you're expanding the number of
terms in the index, so there's some overhead there. On the plus side,
however, your bi-grams, as you've described them, are ordered--which
reduces the potential # of bi-grams in your data set by a factor of
1/2.
-Babak
Tangent: Liat's example brings up an interesting issue about n-grams,
namely that indexing only internally sorted n-grams is a good strategy
for economizing on the number of terms in an index of n-grams--by a
factor of 1/n!, I think. No?
On Mon, Mar 16, 2009 at 4:55 AM, liat oren <oren.l...@gmail.com>
wrote:
Hi,
Is there any idea of how to make it work?
Many thanks,
Liat
2009/3/9 liat oren <oren.l...@gmail.com>
I have an index that has for every two words a score.
I would like my analyzer - that is a combination of whitespace
tokenizer, a
stop words analyzer and stemming.
The regular score of Lucene takes into account the position of the
words.
I would like to add another factor to that score which is these
score
between words.
Instead of having score 0 to words that are not equal, I would
like to use
this index in the calculation.
Is it better explained?
Thanks a lot,
Liat
2009/3/9 Grant Ingersoll <gsing...@apache.org>
Hmmm, I have some inklings of an idea, but can we take a step
back? Can
you explain the problem you are trying to solve at a higher level
(instead
of the current solution)? I imagine it is something related to
co-occurrence analysis.
On Mar 8, 2009, at 8:05 AM, liat oren wrote:
Hi Grant,
No, you can only have two words - the score is between two words.
"cat dog" and "dog cat" is equivalent, it will actually always
be "cat
dog"
- going by alphabetic order.
About the boosting, I read a bit about it - but couldn't find
how it can
help me, unless I change every appearance of the word dog to
have also
cat
and animal using the weight of the score.
So, for example, every word will appear 10 times from what it is
- if
apple
appears 1, I will do the boosting so it appears 10 times.
If dog appears, then it will also have cat twice (0.2*10) and
animal 5
times(0.5*10).
But I hope to have another better solution.
Thanks
2009/3/8 Grant Ingersoll <gsing...@apache.org>
Hi Liat,
Some questions inline below.
On Mar 8, 2009, at 5:49 AM, liat oren wrote:
Hi,
I have scores between words, for example - dog and animal have
a score
of
0.5 (and not 0), dog and cat have a score of 0.2, etc.
These scores are stored in an index:
Doc1: field words: dog animal
field score: 0.5
Doc2: field words: dog cat
field score: 0.2
If the user searches for the word dog - I would like that
documents
that
contain the word animal or cat will also get a good score
(that will
take
into account the 0.5 and 0.2).
Is it always the case that these come in pairs? In other
words, would
you
ever have:
field words: dog cat animal
score: 0.9
Also, is the following equivalent, or would it have a different
score:
field words: cat dog
score: 0.2
Basically what I do is: for every document in the database, I
loop over
the
words that appear in the query (the query is long in a size of
an
article)
and for every word that appears in each document I take the
score from
the
index mentioned above and calculating a score between the
query and
each
document.
Any suggestion how to do it using Lucene search? How to add
these
values
to
the searcher?
Thinking...
I looked at the boosting option, but couldn't really see how
it helps
me
to
that matter.
What "boosting option" did you look at? Can you explain a bit
more?
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/
Droids) using
Solr/Lucene:
http://www.lucidimagination.com/search
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org