Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-22 Thread Wolf Siberski
Doug Cutting wrote: Wolf Siberski wrote: Now I found another solution which requires more changes, but IMHO is much cleaner: - when a query computes its Weight, it caches it in an attribute - a query can be 'frozen'. A frozen query always returns the cached Weight when calling Query.weight().

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-22 Thread Doug Cutting
Wolf Siberski wrote: The price is an extension (or modification) of the Searchable interface. I've added corresponding search(Weight...) methods to the existing search(Query...) methods and deprecated the latter. I think this is the right solution. If Searchable is meant to be Lucene internal,

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-21 Thread Doug Cutting
Wolf Siberski wrote: Now I found another solution which requires more changes, but IMHO is much cleaner: - when a query computes its Weight, it caches it in an attribute - a query can be 'frozen'. A frozen query always returns the cached Weight when calling Query.weight(). Orignally there was no

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-18 Thread Wolf Siberski
Doug Cutting wrote: Christoph Goller wrote: The similarity specified for the search has to be modified so that both idf(...) AND queryNorm(...) always return 1 and as you say everything except for tf(term,doc)*docNorm(doc) could be precompiled into the boosts of the rewritten query.

Re: single field code ready - Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-07 Thread Daniel Naber
On Tuesday 08 February 2005 00:06, David Spencer wrote: So, does this make sense and is it useful way of trying to evaluate the Similarities? Is this the MultiFieldQueryParser from Lucene 1.4? Then it's buggy anyway, so it probably doesn't make sense to test it. But even with the current SVN

RE: single field code ready - Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-07 Thread Chuck Williams
Dave has done great work pulling this together. However, the same comment is true for DistributingMultiFieldQueryParser. There is only 1 field, so both multi-field query parsers are equivalent to QueryParser. Chuck -Original Message- From: Daniel Naber [mailto:[EMAIL PROTECTED]

RE: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-02 Thread Chuck Williams
Paul Elschot wrote: On Wednesday 02 February 2005 03:38, Chuck Williams wrote: I was hoping to do this by simple thresholding, e.g. achieve a property like results with all terms matched are always in [0.8, 1.0], and results missing a term always have a score less than 0.8. I'm

RE: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-01 Thread Chuck Williams
I agree with all this if one is using Default-AND. I prefer Default-OR, combined with scores that support the ability of the application to know the quality of results returned (e.g., which results match all terms, if any, and which don't, at the very least). What you describe below as

URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-31 Thread David Spencer
I worked w/ Chuck to get up a test page that shows search results with 2 versions of Similarity side by side. URL here: http://www.searchmorph.com/kat/wikipedia-similarity.jsp Weblog entry here w/ some more details: http://www.searchmorph.com/weblog/index.php?id=46 But briefly

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-31 Thread Doug Cutting
Doug Cutting wrote: It would translate a query t1 t2 given fields f1 and f2 into something like: +(f1:t1^b1 f2:t1^b2) +(f2:t1^b1 f2:t2^b2) Oops. The first term on that line should be f1:t2, not f2:t1: +(f1:t2^b1 f2:t2^b2) f1:t1 t2~s1^b3 f2:t1 t2~s2^b4 Doug

RE: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-31 Thread Chuck Williams
Doug Cutting wrote: Here's a three term query: +(f1:t1^b1 f2:t1^b2) +(f1:t2^b1 f2:t2^b2) +(f1:t3^b1 f2:t3^b2) f1:t1 t2 t3~s1^b3 f2:t1 t2 t3~s2^b4 That expansion is scalable, but it only accounts for proximity of all query terms together. E.g., it does not favor a match

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-31 Thread Andrzej Bialecki
Folks, In the light of this discussion, I'm working slowly on a new release of Luke, which will include a BeanShell-driven Similarity designer. However, this particular module is not finished yet... given my current workload, this will take a week or two more... -- Best regards, Andrzej

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-31 Thread David Spencer
Doug Cutting wrote: David Spencer wrote: But what is right if there are 2 terms in terms of the phrases - does it have a phrase for every pair of terms like this (ignore fields and boosts and proximity for a sec): search for t1 t2 t3 gives you these phrases in addition to the direct field

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-31 Thread Doug Cutting
Chuck Williams wrote: That expansion is scalable, but it only accounts for proximity of all query terms together. E.g., it does not favor a match where t1 and t2 are close together while t3 is distant over a match where all 3 terms are distant. Worse, it would not favor a match with t1 and t2 in

RE: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-31 Thread Chuck Williams
Doug Cutting wrote: What did you think of my DensityPhraseQuery proposal? It is a step in the direction of what I have in mind, but I'd like to go further. How about a query class with these properties: 1. Inputs are: a. F = list of fields b. B = list of field boosts (1:1

Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-29 Thread Daniel Naber
On Saturday 29 January 2005 00:37, David Spencer wrote: Hmmm, is it safe to assume I can build the index w/ lucene-1.4.3.jar but deploy the webapp for searching w/ lucene-1.5-rc1-dev.jar? Yes, everything else would be a bug. And is the current code supposed to build with so many

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-28 Thread Doug Cutting
Christoph Goller wrote: The similarity specified for the search has to be modified so that both idf(...) AND queryNorm(...) always return 1 and as you say everything except for tf(term,doc)*docNorm(doc) could be precompiled into the boosts of the rewritten query. coord/tf/sloppyFreq computation

RE: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-28 Thread Chuck Williams
David, I just posted WikipediaSimilarity to Bug 32674. I've also reviewed and tested the port to Java 1.4 -- it's fine (although all the casts remind me why I like 1.5 so much). Thanks to Miles Barr for this port! You don't want any of the test classes. You just need these 4 classes:

Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-28 Thread Daniel Naber
On Friday 28 January 2005 17:53, Chuck Williams wrote: I think the baseline should use Lucene's MultiFieldQueryParser to expand the query to search both title and body fields, as this is presumably the current out-of-the-box solution. Please remember that this is kind of buggy in Lucene 1.4:

RE: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-28 Thread Chuck Williams
Yes, but this part of the point. Lucene is a field-based search engine and its built-in support for taking simple queries and searching across relevant fields is poor. The fact that is requires all terms in all fields is part of the problem. Once that is addressed, another problem is that

RE: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-28 Thread Chuck Williams
Sorry for the mispost -- fingers slipped... Yes, but this part of the point. Lucene is a field-based search engine and its built-in support for taking simple queries and searching across relevant fields is poor. The fact that it requires all terms in all fields is part of the problem. Once

RE: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-28 Thread Chuck Williams
OK with me, assuming everything will run in the CVS version and there aren't changes that affect the semantics of any of my code. I've never tried it, and don't know whether or not Dave has. Chuck -Original Message- From: Daniel Naber [mailto:[EMAIL PROTECTED] Sent: Friday,

Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-28 Thread David Spencer
Daniel Naber wrote: On Friday 28 January 2005 22:45, Chuck Williams wrote: The fact that is requires all terms in all fields is part of the problem. Once that is addressed, another problem is that Lucene does not provide a good mechanis That's fixed in CVS, so maybe the CVS version should be

RE: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-28 Thread Chuck Williams
Dave, are you using MultiFieldQueryParser and DefaultSimilarity for the vanilla implementation? It's important to know what we are comparing... Chuck -Original Message- From: David Spencer [mailto:[EMAIL PROTECTED] Sent: Friday, January 28, 2005 3:38 PM To: Lucene Developers

Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-28 Thread David Spencer
Chuck Williams wrote: Dave, are you using MultiFieldQueryParser and DefaultSimilarity for the vanilla implementation? Yes that's the plan. I'll try to have links to source etc too. It's important to know what we are comparing... I agree, that's why I'm trying to make sure everything is spelled

RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-27 Thread Chuck Williams
Doug Cutting wrote: Like anything else in an all-volunteer operation, it will only happen if folks volunteer to do it. Someone needs to take the lead and index a reference collection with a couple of different Similarity implementations and post the code and the results of various

Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-27 Thread Chuck Williams
David, Thanks for taking the lead on this! You have two fields for this collection, title and body, right? I'd like to configure this to use my DistributingMultiFieldQueryParser, MaxDisjunctionQuery (and MaxDisjunctionScorer), and Similarity. DistributingMultiFieldQueryParser has a simple API

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-14 Thread Doug Cutting
Wolf Siberski wrote: Doug Cutting wrote: So, when a query is executed on a MultiSearcher of RemoteSearchables, the following remote calls are made: 1. RemoteSearchable.rewrite(Query) is called After that step, are wildcards replaced by term lists? Yes. I haven't taken a look at the rewrite()

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-14 Thread Doug Cutting
Chuck Williams wrote: Doug Cutting wrote: It would indeed be nice to be able to short-circuit rewriting for queries where it is a no-op. Do you have a proposal for how this could be done? First, this gets into the other part of Bug 31841. I don't believe MultiSearcher.rewrite() is ever

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-13 Thread Paul Elschot
On Thursday 13 January 2005 01:19, Chuck Williams wrote: I think there is another problem here. It is currently the Weight implementations that do rewrite(), which requires access to the index, not just to the idf's. E.g., RangeQuery.rewrite() must find the terms in the index within the

RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-13 Thread Chuck Williams
It's a good point that the aggregate idf table holds enough information to do the rewrite()'s. So MultiSearcher can compute the Weights, which avoids the need to distribute the aggregate tables to the remote nodes. It is still necessary to compute them and keep them current under index updates on

RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-13 Thread Chuck Williams
It just seems like a lot of IPC activity for each query. As things stand now, I think you are proposing this? 1. MultiSearcher calls the remote node to rewrite the query, requiring serialization of the query. 2. The remote node returns the rewritten query to the dispatcher node, which

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-12 Thread Doug Cutting
Chuck Williams wrote: I was thinking of the aggressive version with an index-time solution, although I don't know the Lucene architecture for distributed indexing and searching well enough to formulate the idea precisely. Conceptually, I'd like each server that owns a slice of the index in a

RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-12 Thread Chuck Williams
Ahhh, I didn't understand the part about caching the results in the central dispatch node. I thought you were accessing the remote nodes on every query to sum the docFreq's in each remote index for each query term. I was trying to avoid a large number of round-trips to the remote nodes by

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-12 Thread Doug Cutting
Chuck Williams wrote: There needs to be a way to create the aggregate docFreq table and keep it current under incremental changes to the indices on the various remote nodes. I think you're getting ahead of yourself. Searchers are based on IndexReaders, and hence doFreqs don't change until a new

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-12 Thread Doug Cutting
Wolf Siberski wrote: Chuck Williams wrote: This is a nice solution! By having MultiSearcher create the Weight, it can pass itself in as the searcher, thereby allowing the correct docFreq() method to be called. This is similar to what I tried to do with topmostSearcher, but a much better way to

RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-12 Thread Chuck Williams
Doug Cutting wrote: Searchers are based on IndexReaders, and hence doFreqs don't change until a new Searcher is created. So long as this is true, and the central dispatch node uses a searcher, then a simple cache, perhaps that is pre-fetched, is all that's feasable. It shouldn't

RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-12 Thread Chuck Williams
I think there is another problem here. It is currently the Weight implementations that do rewrite(), which requires access to the index, not just to the idf's. E.g., RangeQuery.rewrite() must find the terms in the index within the range. So, the Weight cannot be computed in the MultiSearcher,

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-11 Thread Doug Cutting
Chuck Williams wrote: As Wolf does, I hope a committer with deep knowledge of Lucene's design in this area will weigh in on the issue and help to resolve it. The root of the bug is in MultiSearcher.search(). This should construct a Weight, weight the query, then score the now-weighted query.

RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-11 Thread Chuck Williams
This is a nice solution! By having MultiSearcher create the Weight, it can pass itself in as the searcher, thereby allowing the correct docFreq() method to be called. This is similar to what I tried to do with topmostSearcher, but a much better way to do it. I'm still left wondering if having

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-11 Thread Doug Cutting
Chuck Williams wrote: This is a nice solution! By having MultiSearcher create the Weight, it can pass itself in as the searcher, thereby allowing the correct docFreq() method to be called. Glad to hear it at least makes sense... Now I hope it works! I'm still left wondering if having

RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-11 Thread Chuck Williams
Doug Cutting wrote: I'm not sure exactly what you mean by distribute the idf information out to the RemoteSearchable. I think one might profitably implement a docFreq() cache in RemoteSearchable. This could be a simple cache, or it could be fairly agressive, pre-fetching all the