Joel Barry created LUCENE-4559:
----------------------------------
Summary: PerFieldSimilarityWrapper
Key: LUCENE-4559
URL: https://issues.apache.org/jira/browse/LUCENE-4559
Project: Lucene - Core
Issue Type: Improvement
Affects Versions: 4.0
Reporter: Joel Barry
Priority: Minor
This issue requests that documentation be clarified for the current
behavior of queryNorm() and coord() on PerFieldAnalyzerWrapper and
that support is added for the use case described below.
The documentation for PerFieldAnalyzerWrapper (lucene 4.0) says:
{noformat}
Subclasses should implement get(String) to return an appropriate
Similarity (for example, using field-specific parameter values) for
the field.
{noformat}
This is misleading because of the behavior for queryNorm() and
coord(). The Similarity returned from get() is not accessed for these
methods. Instead, the PerFieldAnalyzerWrapper subclass methods are
called. I understand that this is because these methods apply to the
query as a whole rather than per field. However, consider the
following. A PerFieldAnalyzerWrapper with no per-field behavior (just
returns DefaultSimilarity in get()) behaves differently than
DefaultSimilarity itself:
{noformat}
class MyPerFieldSimilarity1 extends PerFieldSimilarityWrapper {
@Override
public Similarity get(String name) {
return new DefaultSimilarity();
}
}
public class PerFieldSimilarityWrapperTest {
private float runQuery(Similarity similarity) throws IOException {
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, new
WhitespaceAnalyzer(Version.LUCENE_40));
config.setSimilarity(similarity);
Directory dir = new RAMDirectory();
IndexWriter writer = new IndexWriter(dir, config);
Document doc = new Document();
doc.add(new TextField("A-field", "first", Store.YES));
writer.addDocument(doc);
writer.commit();
IndexReader reader = DirectoryReader.open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
searcher.setSimilarity(similarity);
TermQuery query = new TermQuery(new Term("A-field", "first"));
TopDocs topDocs = searcher.search(query, 1);
return topDocs.scoreDocs[0].score;
}
@Test
public void testSimple() throws Exception {
float score1 = runQuery(new DefaultSimilarity());
float score2 = runQuery(new MyPerFieldSimilarity1());
assertEquals(score1, score2, 0.0001);
// java.lang.AssertionError:
// expected:<0.3068528175354004> but was:<0.09415864944458008>
}
{noformat}
One solution is to override and forward, e.g.
{noformat}
class MyPerFieldSimilarity1 extends PerFieldSimilarityWrapper {
@Override
public Similarity get(String name) {
return new DefaultSimilarity();
}
@Override
public float coord(int overlap, int maxOverlap) {
return get("dummy").coord(overlap, maxOverlap);
}
@Override
public float queryNorm(float valueForNormalization) {
return get("dummy").queryNorm(valueForNormalization);
}
}
{noformat}
However, these methods don't have access to query field data, thus the
"dummy" argument.
Suppose an application arranges documents so that there are two
distinct field groupings:
{noformat}
Document:
A-field1
A-field2
A-field3
B-field1
B-field2
B-field3
{noformat}
The application creates queries that use the A fields, or the B
fields, but never both A and B in the same query. Then it seems
reasonable that PerFieldAnalyzerWrapper should provide a way for
queryNorm() and coord() to operate on these sets of fields. This
cannot be done with the current implementation.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]