Hey, On Wed, Jan 4, 2012 at 1:15 PM, Hany Azzam <h...@eecs.qmul.ac.uk> wrote: > Hi, > > I am experimenting with the Lucene trunk (aka 4.0), especially with the new > IndexDocValues feature. I am trying to store some query-independent > statistics such as PageRank, etc. One stat that I am trying to store is the > sum of all the term frequencies in a document. This can be seen as the > document length. Is there a way to pre-compute this sum while performing the > indexing?
Lucene is already computing the length of the document in its FieldInvertedState which is passed to similarity ie. look at Similarity#computeNorms. Currently the norm value is a single byte but I am working on exposing this via DocValues so you can store custom data in your similarity. simon > > Thank you, > h. > > > >> TermVectors are still available in Lucene trunk aka 4.0, we just changed the >> implementation of them to fit the general Lucene Terms/Fields/… APIs. >> TermVectors (if enabled in the document during indexing) are simply >> something like a small index per document written to disk like a stored >> field (it has nothing to do with DocValues, because you mentioned this). >> Theoretically, you can execute a query against the small “TermVectors Index” >> and get exactly one hit or no hit, if the query matches this document. This >> is e.g. used for highlighting if TV are enabled. To support this “TV as a >> small index”, the old API was removed and the new TermVectors API returns >> the same Terms/TermsEnum/DocsEnum APIs like IndexReader for a complete >> index, but all structures simply return one document (ID=0) and >> corresponding term frequencies/doc frequencies. >> >> To have some example code how to use it, review the Lucene testcases, some >> example: >> >> Terms result = >> reader.getTermVectors(docId).terms(DocHelper.TEXT_FIELD_2_KEY); >> assertNotNull(result); >> assertEquals(3, result.getUniqueTermCount()); >> TermsEnum termsEnum = result.iterator(null); >> while(termsEnum.next() != null) { >> String term = termsEnum.term().utf8ToString(); >> int freq = (int) termsEnum.totalTermFreq(); >> assertTrue(freq > 0); >> } >> >> Fields results = reader.getTermVectors(docId); >> assertTrue(results != null); >> assertEquals("We do not have 3 term freq vectors", 3, >> results.getUniqueFieldCount()); >> >> Uwe >> >> ----- >> Uwe Schindler >> H.-H.-Meier-Allee 63, D-28213 Bremen >> http://www.thetaphi.de >> eMail: u...@thetaphi.de >> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org