Doug Cutting wrote:
Wolf Siberski wrote:
Now I found another solution which requires more changes, but IMHO is
much cleaner:
- when a query computes its Weight, it caches it in an attribute
- a query can be 'frozen'. A frozen query always returns the cached
Weight when calling Query.weight().
Wolf Siberski wrote:
The price is an extension (or modification) of the
Searchable interface. I've added corresponding search(Weight...) methods
to the existing search(Query...) methods and deprecated the latter.
I think this is the right solution.
If Searchable is meant to be Lucene internal,
Wolf Siberski wrote:
Now I found another solution which requires more changes, but IMHO is
much cleaner:
- when a query computes its Weight, it caches it in an attribute
- a query can be 'frozen'. A frozen query always returns the cached
Weight when calling Query.weight().
Orignally there was no
Doug Cutting wrote:
Christoph Goller wrote:
The similarity specified for the search has to be modified so that both
idf(...) AND queryNorm(...) always return 1 and as you say everything
except for tf(term,doc)*docNorm(doc) could be precompiled into the boosts
of the rewritten query.
On Tuesday 08 February 2005 00:06, David Spencer wrote:
So, does this make sense and is it useful way of trying to evaluate the
Similarities?
Is this the MultiFieldQueryParser from Lucene 1.4? Then it's buggy
anyway, so it probably doesn't make sense to test it. But even with the
current SVN
Dave has done great work pulling this together. However, the same
comment is true for DistributingMultiFieldQueryParser. There is only 1
field, so both multi-field query parsers are equivalent to QueryParser.
Chuck
-Original Message-
From: Daniel Naber [mailto:[EMAIL PROTECTED]
Paul Elschot wrote:
On Wednesday 02 February 2005 03:38, Chuck Williams wrote:
I was hoping to do this
by simple thresholding, e.g. achieve a property like results with
all
terms matched are always in [0.8, 1.0], and results missing a term
always have a score less than 0.8. I'm
I agree with all this if one is using Default-AND.
I prefer Default-OR, combined with scores that support the ability of
the application to know the quality of results returned (e.g., which
results match all terms, if any, and which don't, at the very least).
What you describe below as
I worked w/ Chuck to get up a test page that shows search results with 2
versions of Similarity side by side.
URL here:
http://www.searchmorph.com/kat/wikipedia-similarity.jsp
Weblog entry here w/ some more details:
http://www.searchmorph.com/weblog/index.php?id=46
But briefly
Doug Cutting wrote:
It would translate a query t1 t2 given fields f1 and f2 into
something like:
+(f1:t1^b1 f2:t1^b2)
+(f2:t1^b1 f2:t2^b2)
Oops. The first term on that line should be f1:t2, not f2:t1:
+(f1:t2^b1 f2:t2^b2)
f1:t1 t2~s1^b3
f2:t1 t2~s2^b4
Doug
Doug Cutting wrote:
Here's a three term query:
+(f1:t1^b1 f2:t1^b2)
+(f1:t2^b1 f2:t2^b2)
+(f1:t3^b1 f2:t3^b2)
f1:t1 t2 t3~s1^b3
f2:t1 t2 t3~s2^b4
That expansion is scalable, but it only accounts for proximity of all
query terms together. E.g., it does not favor a match
Folks,
In the light of this discussion, I'm working slowly on a new release of
Luke, which will include a BeanShell-driven Similarity designer.
However, this particular module is not finished yet... given my current
workload, this will take a week or two more...
--
Best regards,
Andrzej
Doug Cutting wrote:
David Spencer wrote:
But what is right if there are 2 terms in terms of the phrases -
does it have a phrase for every pair of terms like this (ignore fields
and boosts and proximity for a sec):
search for t1 t2 t3 gives you these phrases in addition to the
direct field
Chuck Williams wrote:
That expansion is scalable, but it only accounts for proximity of all
query terms together. E.g., it does not favor a match where t1 and t2
are close together while t3 is distant over a match where all 3 terms
are distant. Worse, it would not favor a match with t1 and t2 in
Doug Cutting wrote:
What did you think of my DensityPhraseQuery proposal?
It is a step in the direction of what I have in mind, but I'd like to go
further. How about a query class with these properties:
1. Inputs are:
a. F = list of fields
b. B = list of field boosts (1:1
On Saturday 29 January 2005 00:37, David Spencer wrote:
Hmmm, is it safe to assume I can build the index w/ lucene-1.4.3.jar but
deploy the webapp for searching w/ lucene-1.5-rc1-dev.jar?
Yes, everything else would be a bug.
And is the current code supposed to build with so many
Christoph Goller wrote:
The similarity specified for the search has to be modified so that both
idf(...) AND queryNorm(...) always return 1 and as you say everything
except for tf(term,doc)*docNorm(doc) could be precompiled into the boosts
of the rewritten query. coord/tf/sloppyFreq computation
David,
I just posted WikipediaSimilarity to Bug 32674. I've also reviewed and
tested the port to Java 1.4 -- it's fine (although all the casts remind
me why I like 1.5 so much). Thanks to Miles Barr for this port!
You don't want any of the test classes. You just need these 4 classes:
On Friday 28 January 2005 17:53, Chuck Williams wrote:
I think the baseline should use Lucene's MultiFieldQueryParser to expand
the query to search both title and body fields, as this is presumably
the current out-of-the-box solution.
Please remember that this is kind of buggy in Lucene 1.4:
Yes, but this part of the point. Lucene is a field-based search engine
and its built-in support for taking simple queries and searching across
relevant fields is poor. The fact that is requires all terms in all
fields is part of the problem. Once that is addressed, another problem
is that
Sorry for the mispost -- fingers slipped...
Yes, but this part of the point. Lucene is a field-based search engine
and its built-in support for taking simple queries and searching across
relevant fields is poor. The fact that it requires all terms in all
fields is part of the problem. Once
OK with me, assuming everything will run in the CVS version and there aren't
changes that affect the semantics of any of my code. I've never tried it, and
don't know whether or not Dave has.
Chuck
-Original Message-
From: Daniel Naber [mailto:[EMAIL PROTECTED]
Sent: Friday,
Daniel Naber wrote:
On Friday 28 January 2005 22:45, Chuck Williams wrote:
The fact that is requires all terms in all
fields is part of the problem. Once that is addressed, another problem
is that Lucene does not provide a good mechanis
That's fixed in CVS, so maybe the CVS version should be
Dave, are you using MultiFieldQueryParser and DefaultSimilarity for the
vanilla implementation?
It's important to know what we are comparing...
Chuck
-Original Message-
From: David Spencer [mailto:[EMAIL PROTECTED]
Sent: Friday, January 28, 2005 3:38 PM
To: Lucene Developers
Chuck Williams wrote:
Dave, are you using MultiFieldQueryParser and DefaultSimilarity for the
vanilla implementation?
Yes that's the plan. I'll try to have links to source etc too.
It's important to know what we are comparing...
I agree, that's why I'm trying to make sure everything is spelled
Doug Cutting wrote:
Like anything else in an all-volunteer operation, it will only
happen if
folks volunteer to do it. Someone needs to take the lead and index
a
reference collection with a couple of different Similarity
implementations and post the code and the results of various
David,
Thanks for taking the lead on this!
You have two fields for this collection, title and body, right?
I'd like to configure this to use my DistributingMultiFieldQueryParser,
MaxDisjunctionQuery (and MaxDisjunctionScorer), and Similarity.
DistributingMultiFieldQueryParser has a simple API
Wolf Siberski wrote:
Doug Cutting wrote:
So, when a query is executed on a MultiSearcher of RemoteSearchables,
the following remote calls are made:
1. RemoteSearchable.rewrite(Query) is called
After that step, are wildcards replaced by term lists?
Yes.
I haven't taken a look at the rewrite()
Chuck Williams wrote:
Doug Cutting wrote:
It would indeed be nice to be able to short-circuit rewriting for
queries where it is a no-op. Do you have a proposal for how this
could
be done?
First, this gets into the other part of Bug 31841. I don't believe
MultiSearcher.rewrite() is ever
On Thursday 13 January 2005 01:19, Chuck Williams wrote:
I think there is another problem here. It is currently the Weight
implementations that do rewrite(), which requires access to the index,
not just to the idf's. E.g., RangeQuery.rewrite() must find the terms
in the index within the
It's a good point that the aggregate idf table holds enough information
to do the rewrite()'s. So MultiSearcher can compute the Weights, which
avoids the need to distribute the aggregate tables to the remote nodes.
It is still necessary to compute them and keep them current under index
updates on
It just seems like a lot of IPC activity for each query. As things
stand now, I think you are proposing this?
1. MultiSearcher calls the remote node to rewrite the query,
requiring serialization of the query.
2. The remote node returns the rewritten query to the dispatcher
node, which
Chuck Williams wrote:
I was thinking of the aggressive version with an index-time solution,
although I don't know the Lucene architecture for distributed indexing
and searching well enough to formulate the idea precisely.
Conceptually, I'd like each server that owns a slice of the index in a
Ahhh, I didn't understand the part about caching the results in the
central dispatch node. I thought you were accessing the remote nodes on
every query to sum the docFreq's in each remote index for each query
term. I was trying to avoid a large number of round-trips to the remote
nodes by
Chuck Williams wrote:
There needs to be a way to create the aggregate docFreq table and keep
it current under incremental changes to the indices on the various
remote nodes.
I think you're getting ahead of yourself. Searchers are based on
IndexReaders, and hence doFreqs don't change until a new
Wolf Siberski wrote:
Chuck Williams wrote:
This is a nice solution! By having MultiSearcher create the Weight, it
can pass itself in as the searcher, thereby allowing the correct
docFreq() method to be called. This is similar to what I tried to do
with topmostSearcher, but a much better way to
Doug Cutting wrote:
Searchers are based on
IndexReaders, and hence doFreqs don't change until a new Searcher is
created. So long as this is true, and the central dispatch node
uses a
searcher, then a simple cache, perhaps that is pre-fetched, is all
that's feasable. It shouldn't
I think there is another problem here. It is currently the Weight
implementations that do rewrite(), which requires access to the index,
not just to the idf's. E.g., RangeQuery.rewrite() must find the terms
in the index within the range. So, the Weight cannot be computed in the
MultiSearcher,
Chuck Williams wrote:
As Wolf does, I hope a committer with deep knowledge of Lucene's design
in this area will weigh in on the issue and help to resolve it.
The root of the bug is in MultiSearcher.search(). This should construct
a Weight, weight the query, then score the now-weighted query.
This is a nice solution! By having MultiSearcher create the Weight, it
can pass itself in as the searcher, thereby allowing the correct
docFreq() method to be called. This is similar to what I tried to do
with topmostSearcher, but a much better way to do it.
I'm still left wondering if having
Chuck Williams wrote:
This is a nice solution! By having MultiSearcher create the Weight, it
can pass itself in as the searcher, thereby allowing the correct
docFreq() method to be called.
Glad to hear it at least makes sense... Now I hope it works!
I'm still left wondering if having
Doug Cutting wrote:
I'm not sure exactly what you mean by distribute the idf
information
out to the RemoteSearchable. I think one might profitably
implement a
docFreq() cache in RemoteSearchable. This could be a simple cache,
or
it could be fairly agressive, pre-fetching all the
42 matches
Mail list logo