: A possible solution would be to initialize in turn each document as a
: query, do a search using an IndexSearcher and to take from the search
: result the similarity between the query (which is in fact a document)
: and all the other documents. This is highly redundant, because the
: similarity between a pair of documents is computed multiple times.
A simpler aproach that i can think of would be to iterate over a complete
TermEnum of hte index, and for each Term, get the corisponding TermDocs
enumerator to list every document that contains that term. Assuming that
every pair of docs initially has a similarity of "0" this would allow you
to incriment the similarity of each pair everytime you find a term that
multiple docs have in common. (the amount you incriment the score for
each pair could be based on TermEnum.docFreq() and TermDocs.freq()).
A very simple approach might be something like...
IndexReader r = ...;
int[][] scores = new int[r.maxDocs()][r.maxDocs()];
TermEnum enumerator = r.terms();
TermDocs termDocs = r.termDocs();
do {
Term term = enumerator.term();
if (term != null) {
termDocs.seek(enumerator.term());
Map docs = new HashMap();
while (termDocs.next()) {
docs.put(termDocs.doc(),termDoc.freq());
}
for (Iterator i = docs.keySet().iterator(); i.hasNext();) {
for (Iterator j = docs.keySet().iterator(); j.hasNext();) {
ii == i.next();
jj = j.next();
if (ii < jj) {
continue; // do each pair only once
}
scores[jj][ii] += (docs.get(ii) + docs.get(jj)) / 2
}
}
} else {
break;
}
} while (enumerator.next());
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]