I think that a more reasonable approach for experiments like this is to store statistics of the sort that you want as part of the indexing process. That will give you complete flexibility to do what you need.
Then at retrieval time you can access and pass in term level information into a custom similarity function. That leaves you with a good barrier between index-time and search-time, but still gives you any information that you might like to use. Having your data in a side file helps you avoid having to deal with those aspects of Lucene that are highly oriented around efficiency which are good in their place, but could make your research work much more difficult in the exploratory phase. On Sun, Aug 9, 2009 at 8:45 PM, K. M. McCormick <[email protected]>wrote: > Hello There: > > I am currently working on an INDEX STAT GENERATOR I'd like to use for some > term-weight tests in a (rather large) Lucene Index. In general, the stats > I'm hoping to work with are based on a term's frequency across the entire > indexed document set. > > TFIDF easily works in Lucene's searcher - and you can get access to a > Term's > DF (across all documents, obviously) quite easily. However, TF in Lucene > seems limited to a by-document basis. Meaning, to generate the number of > times this term has appeared in the indexed document set, I would have to > (hypothetically) do the following: > > - Given Term t, find TF(t) > - Get the enumeration of t over the index - TermDocs (so I have doc, freq > pairings) > - For each (doc, freq) pair, add freq to the total-index-frequency > > So if I have x terms, I would be iterating through x*TF(t) for the entire > index to find out the index-frequency for all terms. Is this the only > method > of getting this information? > > Since my data set (and term set) are quite large, I was trying to find if > there was another mechanism in place for Lucene, either at the indexing or > the searching level. However, I've had little luck sifting through the > information I've gotten (mostly points me to TFIDF) to find out if Lucene > has something I can use to make this process faster. > > I have also read a bit about TermVectors, but those seem by-document as > well. > > If there isn't a method at the search level (or, > after-index-complete-level), I would be willing to accept the overhead of > generating these stats at indexing time, if that would be more efficient... > > Thanks, > drago > -- Ted Dunning, CTO DeepDyve
