Hello There: I am currently working on an INDEX STAT GENERATOR I'd like to use for some term-weight tests in a (rather large) Lucene Index. In general, the stats I'm hoping to work with are based on a term's frequency across the entire indexed document set.
TFIDF easily works in Lucene's searcher - and you can get access to a Term's DF (across all documents, obviously) quite easily. However, TF in Lucene seems limited to a by-document basis. Meaning, to generate the number of times this term has appeared in the indexed document set, I would have to (hypothetically) do the following: - Given Term t, find TF(t) - Get the enumeration of t over the index - TermDocs (so I have doc, freq pairings) - For each (doc, freq) pair, add freq to the total-index-frequency So if I have x terms, I would be iterating through x*TF(t) for the entire index to find out the index-frequency for all terms. Is this the only method of getting this information? Since my data set (and term set) are quite large, I was trying to find if there was another mechanism in place for Lucene, either at the indexing or the searching level. However, I've had little luck sifting through the information I've gotten (mostly points me to TFIDF) to find out if Lucene has something I can use to make this process faster. I have also read a bit about TermVectors, but those seem by-document as well. If there isn't a method at the search level (or, after-index-complete-level), I would be willing to accept the overhead of generating these stats at indexing time, if that would be more efficient... Thanks, drago
