It seems like there should be a formula for estimating the total number of unique terms given that you know the unique term counts for each segment, and make certain assumptions like random document distribution across segments.
-Yonik http://www.lucidimagination.com On Thu, May 27, 2010 at 9:17 PM, kannan chandrasekaran <ckanna...@yahoo.com> wrote: > I am just trying out a few experiments to calculate similarity between terms > based on their co-occurences in the dataset... Basically I am trying to > build contextual vectors and calculate similarity using a similarity measure > ( say cosine similarity)..... > > I dont think this is an XY problem . The vectors I am trying to build are not > the same as the TermVectors option ((term,freq) pairs per document) in the > lucene ( if thats what u meant) > > Thanks > Kannan --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org