To get stats from the whole index I think you need to come at this from a different direction. See the 4.0 migration guide for some details.
With a variation on your code and 2 docs doc1: foobar qux quote doc2: foobar qux qux quorum this code snippet Fields fields = MultiFields.getFields(rdr); Terms terms = fields.terms("body"); TermsEnum te = terms.iterator(null); while (te.next() != null) { String tt = te.term().utf8ToString(); System.out.printf("%s totalFreq()=%s, docFreq=%s\n", tt, te.totalTermFreq(), te.docFreq()); } displays foobar totalFreq()=2, docFreq=2 quorum totalFreq()=1, docFreq=1 quote totalFreq()=1, docFreq=1 qux totalFreq()=3, docFreq=2 This is with a standard IndexReader as returned by DirectoryReader.open(dir), on a RAMDirectory with 2 docs so there won't be many segments. But from my reading of the migration guide you shouldn't need to use the Composite reader. Hope this helps - we are getting outside my area of expertise so don't trust anything I say. -- Ian. On Thu, Jan 17, 2013 at 3:11 PM, Jon Stewart <j...@lightboxtechnologies.com> wrote: > D'oh!!!! Thanks! > > Does TermsEnum.totalTermFreq() return the per-doc frequencies? It > looks like it empirically, but the documentation refers to corpus > usage, not document.field usage. > > Jon > > On Thu, Jan 17, 2013 at 10:00 AM, Ian Lea <ian....@gmail.com> wrote: >> typo time. You need doc2.add(...) not 2 doc.add(...) statements. >> >> >> -- >> Ian. >> >> >> On Thu, Jan 17, 2013 at 2:49 PM, Jon Stewart >> <j...@lightboxtechnologies.com> wrote: >>> On Thu, Jan 17, 2013 at 9:08 AM, Robert Muir <rcm...@gmail.com> wrote: >>>> Which statistics in particular (which methods)? >>> >>> I'd like to know the frequency of each term in each document. Those >>> term counts for the most frequent terms in the corpus will make it >>> into the document vectors for clustering. >>> >>> Looking at Terms and TermsEnum, I'm actually somewhat baffled about >>> how to do this. Iterating over the TermsEnums in a Terms retrieved by >>> IndexReader.getTermVector() will tell me about the presence of a term >>> within a document, but I don't see a simple "count" or "freq" method >>> in TermsEnum--the methods there look like corpus statistics. >>> >>> Based on Ian's reply, I created the following one-file test program. >>> The results I get are weird: I get a term vector back for the first >>> document, but not for the second. >>> >>> Output: >>> doc 0 had term 'baz' >>> doc 0 had term 'foobar' >>> doc 0 had term 'gibberish' >>> doc 0 had 3 terms >>> doc 1 had no term vector for body >>> >>> Thanks again for the responses and assistance. >>> >>> >>> Jon >>> >>> >>> import java.io.File; >>> import java.io.IOException; >>> >>> import org.apache.lucene.analysis.standard.StandardAnalyzer; >>> >>> import org.apache.lucene.index.IndexWriter; >>> import org.apache.lucene.index.IndexWriterConfig.OpenMode; >>> import org.apache.lucene.index.IndexWriterConfig; >>> import org.apache.lucene.index.FieldInfo.IndexOptions; >>> import org.apache.lucene.index.CorruptIndexException; >>> import org.apache.lucene.index.AtomicReader; >>> import org.apache.lucene.index.IndexableField; >>> import org.apache.lucene.index.Terms; >>> import org.apache.lucene.index.TermsEnum; >>> import org.apache.lucene.index.SlowCompositeReaderWrapper; >>> import org.apache.lucene.index.DirectoryReader; >>> >>> import org.apache.lucene.store.Directory; >>> import org.apache.lucene.store.FSDirectory; >>> >>> import org.apache.lucene.util.BytesRef; >>> import org.apache.lucene.util.Version; >>> >>> import org.apache.lucene.document.Document; >>> import org.apache.lucene.document.Field; >>> import org.apache.lucene.document.StringField; >>> import org.apache.lucene.document.FieldType; >>> >>> public class LuceneTest { >>> >>> static void createIndex(final String path) throws IOException, >>> CorruptIndexException { >>> final Directory dir = FSDirectory.open(new File(path)); >>> final StandardAnalyzer analyzer = new >>> StandardAnalyzer(Version.LUCENE_40); >>> final IndexWriterConfig iwc = new >>> IndexWriterConfig(Version.LUCENE_40, analyzer); >>> iwc.setOpenMode(OpenMode.CREATE_OR_APPEND); >>> iwc.setRAMBufferSizeMB(256.0); >>> final IndexWriter writer = new IndexWriter(dir, iwc); >>> >>> final FieldType bodyOptions = new FieldType(); >>> bodyOptions.setIndexed(true); >>> >>> bodyOptions.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); >>> bodyOptions.setStored(true); >>> bodyOptions.setStoreTermVectors(true); >>> bodyOptions.setTokenized(true); >>> >>> final Document doc = new Document(); >>> doc.add(new Field("body", "this foobar is gibberish, baz", >>> bodyOptions)); >>> writer.addDocument(doc); >>> >>> final Document doc2 = new Document(); >>> doc.add(new Field("body", "I don't know what to tell you, qux. >>> Some foobar is just fubar.", bodyOptions)); >>> writer.addDocument(doc2); >>> >>> writer.close(); >>> } >>> >>> static void readIndex(final String path) throws IOException, >>> CorruptIndexException { >>> final DirectoryReader dirReader = >>> DirectoryReader.open(FSDirectory.open(new File(path))); >>> final SlowCompositeReaderWrapper rdr = new >>> SlowCompositeReaderWrapper(dirReader); >>> >>> int max = rdr.maxDoc(); >>> >>> TermsEnum term = null; >>> // iterate docs >>> for (int i = 0; i < max; ++i) { >>> // get term vector for body field >>> final Terms terms = rdr.getTermVector(i, "body"); >>> if (terms != null) { >>> // count terms in doc >>> int numTerms = 0; >>> term = terms.iterator(term); >>> while (term.next() != null) { >>> System.out.println("doc " + i + " had term '" + >>> term.term().utf8ToString() + "'"); >>> ++numTerms; >>> >>> // would like to record doc term frequencies here, i.e., >>> counts[i][term.term()] = term.freq() >>> } >>> System.out.println("doc " + i + " had " + numTerms + " terms"); >>> } >>> else { >>> System.err.println("doc " + i + " had no term vector for body"); >>> } >>> } >>> } >>> >>> public static void main(String[] args) throws IOException, >>> InterruptedException, CorruptIndexException { >>> final String path = args[0]; >>> createIndex(path); >>> readIndex(path); >>> } >>> } >>> >>> -- >>> Jon Stewart, Principal >>> (646) 719-0317 | j...@lightboxtechnologies.com | Arlington, VA >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > > > -- > Jon Stewart, Principal > (646) 719-0317 | j...@lightboxtechnologies.com | Arlington, VA > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org