Hi,
I am experimenting with the Lucene trunk (aka 4.0), especially with the new
IndexDocValues feature. I am trying to store some query-independent statistics
such as PageRank, etc. One stat that I am trying to store is the sum of all the
term frequencies in a document. This can be seen as the document length. Is
there a way to pre-compute this sum while performing the indexing?
Thank you,
h.
> TermVectors are still available in Lucene trunk aka 4.0, we just changed the
> implementation of them to fit the general Lucene Terms/Fields/… APIs.
> TermVectors (if enabled in the document during indexing) are simply something
> like a small index per document written to disk like a stored field (it has
> nothing to do with DocValues, because you mentioned this). Theoretically, you
> can execute a query against the small “TermVectors Index” and get exactly one
> hit or no hit, if the query matches this document. This is e.g. used for
> highlighting if TV are enabled. To support this “TV as a small index”, the
> old API was removed and the new TermVectors API returns the same
> Terms/TermsEnum/DocsEnum APIs like IndexReader for a complete index, but all
> structures simply return one document (ID=0) and corresponding term
> frequencies/doc frequencies.
>
> To have some example code how to use it, review the Lucene testcases, some
> example:
>
> Terms result =
> reader.getTermVectors(docId).terms(DocHelper.TEXT_FIELD_2_KEY);
> assertNotNull(result);
> assertEquals(3, result.getUniqueTermCount());
> TermsEnum termsEnum = result.iterator(null);
> while(termsEnum.next() != null) {
> String term = termsEnum.term().utf8ToString();
> int freq = (int) termsEnum.totalTermFreq();
> assertTrue(freq > 0);
> }
>
> Fields results = reader.getTermVectors(docId);
> assertTrue(results != null);
> assertEquals("We do not have 3 term freq vectors", 3,
> results.getUniqueFieldCount());
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [email protected]
>