Re: IndexDocValues and storing Stats

Simon Willnauer Wed, 04 Jan 2012 06:37:58 -0800

Hey,

On Wed, Jan 4, 2012 at 1:15 PM, Hany Azzam <h...@eecs.qmul.ac.uk> wrote:
> Hi,
>
> I am experimenting with the Lucene trunk (aka 4.0), especially with the new 
> IndexDocValues feature. I am trying to store some query-independent 
> statistics such as PageRank, etc. One stat that I am trying to store is the 
> sum of all the term frequencies in a document. This can be seen as the 
> document length. Is there a way to pre-compute this sum while performing the 
> indexing?


Lucene is already computing the length of the document in its
FieldInvertedState which is passed to similarity ie. look at
Similarity#computeNorms.  Currently the norm value is a single byte
but I am working  on exposing this via DocValues so you can store
custom data in your similarity.

simon
>
> Thank you,
> h.
>
>
>
>> TermVectors are still available in Lucene trunk aka 4.0, we just changed the 
>> implementation of them to fit the general Lucene Terms/Fields/… APIs. 
>> TermVectors (if enabled in the document during indexing) are simply 
>> something like a small index per document written to disk like a stored 
>> field (it has nothing to do with DocValues, because you mentioned this). 
>> Theoretically, you can execute a query against the small “TermVectors Index” 
>> and get exactly one hit or no hit, if the query matches this document. This 
>> is e.g. used for highlighting if TV are enabled. To support this “TV as a 
>> small index”, the old API was removed and the new TermVectors API returns 
>> the same Terms/TermsEnum/DocsEnum APIs like IndexReader for a complete 
>> index, but all structures simply return one document (ID=0) and 
>> corresponding term frequencies/doc frequencies.
>>
>> To have some example code how to use it, review the Lucene testcases, some 
>> example:
>>
>>     Terms result = 
>> reader.getTermVectors(docId).terms(DocHelper.TEXT_FIELD_2_KEY);
>>     assertNotNull(result);
>>     assertEquals(3, result.getUniqueTermCount());
>>     TermsEnum termsEnum = result.iterator(null);
>>     while(termsEnum.next() != null) {
>>       String term = termsEnum.term().utf8ToString();
>>       int freq = (int) termsEnum.totalTermFreq();
>>       assertTrue(freq > 0);
>>     }
>>
>>     Fields results = reader.getTermVectors(docId);
>>     assertTrue(results != null);
>>     assertEquals("We do not have 3 term freq vectors", 3, 
>> results.getUniqueFieldCount());
>>
>> Uwe
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: u...@thetaphi.de
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: IndexDocValues and storing Stats

Reply via email to