Doug Cutting wrote:
David Spencer wrote:
I worked w/ Chuck to get up a test page that shows search results with 2 versions of Similarity side by side.
David,
This looks great! Thanks for doing this.
Thank you...it involved lots of back & forth interactions w/ Chuck over the few days to get it to the state it's at.
Is the default operator AND or OR? It appears to be OR, but it should probably be AND. That's become the industry standard since QueryParser
I agree w/ defaulting to AND.
was first written. Also, any chance we can get explanations for hits?
It is difficult to decipher what's doing what. I think we should separately evaluate query formulation and boosting from changes to tf/idf.
I agree too in doing this stepwise.
And the "good news" is that the wikipedia are so short (usually 1 word) that they usually don't give that much info anyway - though I guess I don't know for sure if the avg # of words in a wikipedia title is less than the avg www title..
We ought to first compare searching body only, ignoring titles, then subsequently try different query formulations over multiple fields with a fixed weighting algorithm. Yes, ignoring titles when searching wikipedia might not be the best approach, but the point is not to over-optimize for wikipedia but rather to find algorithms that work well with general text collections. Removing titles makes the problem harder, which should in turn make it easier to see deficiencies.
Yes
Simpler yet, we ought to first try body-only with no proximity, just AND, in order to select good tf/idf formulations. Then we should add auto-proximity searching into the mix, and finally add multiple fields. Does this make sense?
MultiFieldQueryParser is known to be deficient. A better general-purpose multi-field query formulator might be like that used by Nutch. It would translate a query "t1 t2" given fields f1 and f2 into something like:
+(f1:t1^b1 f2:t1^b2) +(f2:t1^b1 f2:t2^b2) f1:"t1 t2"~s1^b3 f2:"t1 t2"~s2^b4
But what is right if there are > 2 terms in terms of the phrases - does it have a phrase for every pair of terms like this (ignore fields and boosts and proximity for a sec):
search for "t1 t2 t3" gives you these phrases in addition to the direct field matches:
"t1 t2" "t2 t3" "t1 t3"
Where b1 and b2 are boosts for f1 and f2, and b3 and b4 are boosts for phrase matching in f1 and f2, and s1 and s2 are slop for f1 and f2. We'd really only need to vary b1 and b3, and could use 1.0 for b2 and b4 and infinity for s1 and s2.
Do folks agree that this is a good general formulation? If so, would someone like to contribute a version of MultiFieldQueryParser that implements this? The API should probably be something like:
I might already have this done, just confirm the above question re > 2 terms.
static Query parse(String queryString, String[] fields, float[] boolBoosts, float[] phraseBoosts, int[] slops);
A simplified version might be:
static Query parse(String queryString, String[] fields, float[] boosts);
This could use infinity for slops and assume boolBoosts[i] == phraseBoosts[i].
Doug
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]