Most efficient way to find related terms

Martin Bayly Fri, 29 Feb 2008 12:16:03 -0800

I'm wondering what the most efficient approach is to finding all terms
which are related to a specific term X.


 

By related I mean terms which appear in a specific document field that
also contains the target term X.

 

e.g. Document has a keyword field, field1 that can contain multiple
keywords.

 

document1 - field1 HAS key1, key2, key3

document2 - field1 HAS key2, key4

document3 - field1 HAS key5

 

If I want to find terms related to key2, I need to return key1, key3,
key4

 

Obviously I can do a search for key2, iterate all the docs and collect
there field1 terms manually.

 

But presumably a more efficient way is to use TermDocs:

 

1. TermDocs termdocs = IndexReader.termDocs(new Term("field1", "key2")

2. Iterate term docs to get documents containing that term 

3. Now this is the bit I'm not sure of:

a. I could call Document doc = IndexReader.document(n), but that will
load all fields and I only want the field1

b. Presumably better to call Document doc = IndexReader.document(n,
fieldSelector)

c. Or would I be better to turn on TermFrequencyVectors for this field
so I can call IndexReader.getTermFrequencyVector(n, "field1") - don't
particularly care about the frequencies as it will always be 1 for a
particular doc.

 

Other approaches?

 

I'm going to perf test to see how (b) and (c) compare but would be glad
if anyone has any insights.

 

Thanks

Martin

Most efficient way to find related terms

Reply via email to