Re: IndexDocValues and storing Stats

Hany Azzam Wed, 04 Jan 2012 06:59:04 -0800

Hi Simon,

Thank you for your reply. The document length is just an example of what I need 
to store. Another stat that I need is a *normalised* sum of the TF's. I can 
compute this using my own cache during retrieval by extending the 
SimilarityBase and storing the values in a cache that is used whenever the 
score method is invoked. However, I am trying to push this to the index in 
order to make it more efficient, and as I said earlier I haven't found a way to 
do this yet.


With regard to document length (DL) yes you are right, but unfortunately Lucene 
doesn't provide the raw (real) document length (as far as I know). It only 
provides the encoded/decoded DL. I read on the forum (and from my own 
experiments) that the difference in quality when  implementing a similarity 
function using the raw DL versus implementing the same function but with 
Lucene's exposed (encoded/decoded) DL is not statistically significant. 
However, I still prefer to use the raw DL, and that's why I use the sum of the 
TF's in a document to cache it.

h.


On 4 Jan 2012, at 14:37, Simon Willnauer wrote:

> Hey,
> 
> On Wed, Jan 4, 2012 at 1:15 PM, Hany Azzam <[email protected]> wrote:
>> Hi,
>> 
>> I am experimenting with the Lucene trunk (aka 4.0), especially with the new 
>> IndexDocValues feature. I am trying to store some query-independent 
>> statistics such as PageRank, etc. One stat that I am trying to store is the 
>> sum of all the term frequencies in a document. This can be seen as the 
>> document length. Is there a way to pre-compute this sum while performing the 
>> indexing?
> 
> Lucene is already computing the length of the document in its
> FieldInvertedState which is passed to similarity ie. look at
> Similarity#computeNorms.  Currently the norm value is a single byte
> but I am working  on exposing this via DocValues so you can store
> custom data in your similarity.
> 
> simon
>> 
>> Thank you,
>> h.
>> 
>> 
>> 
>>> TermVectors are still available in Lucene trunk aka 4.0, we just changed 
>>> the implementation of them to fit the general Lucene Terms/Fields/… APIs. 
>>> TermVectors (if enabled in the document during indexing) are simply 
>>> something like a small index per document written to disk like a stored 
>>> field (it has nothing to do with DocValues, because you mentioned this). 
>>> Theoretically, you can execute a query against the small “TermVectors 
>>> Index” and get exactly one hit or no hit, if the query matches this 
>>> document. This is e.g. used for highlighting if TV are enabled. To support 
>>> this “TV as a small index”, the old API was removed and the new TermVectors 
>>> API returns the same Terms/TermsEnum/DocsEnum APIs like IndexReader for a 
>>> complete index, but all structures simply return one document (ID=0) and 
>>> corresponding term frequencies/doc frequencies.
>>> 
>>> To have some example code how to use it, review the Lucene testcases, some 
>>> example:
>>> 
>>>     Terms result = 
>>> reader.getTermVectors(docId).terms(DocHelper.TEXT_FIELD_2_KEY);
>>>     assertNotNull(result);
>>>     assertEquals(3, result.getUniqueTermCount());
>>>     TermsEnum termsEnum = result.iterator(null);
>>>     while(termsEnum.next() != null) {
>>>       String term = termsEnum.term().utf8ToString();
>>>       int freq = (int) termsEnum.totalTermFreq();
>>>       assertTrue(freq > 0);
>>>     }
>>> 
>>>     Fields results = reader.getTermVectors(docId);
>>>     assertTrue(results != null);
>>>     assertEquals("We do not have 3 term freq vectors", 3, 
>>> results.getUniqueFieldCount());
>>> 
>>> Uwe
>>> 
>>> -----
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>> http://www.thetaphi.de
>>> eMail: [email protected]
>>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: IndexDocValues and storing Stats

Reply via email to