Hi! I believe the approach below can help you. http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/misc/src/java/org/apache/lucene/misc/HighFreqTerms.java
Marcio http://numere.stela.org.br Go beyond Lucene™ features with Numere® 2014/1/24 Witdouck, Xavier <xavier.witdo...@blackrock.com> > Hi all, > > We have over 6 million documents in our index, and would like to construct > a term frequency matrix over all 6 million documents as quickly as > possible. Each document has a numeric date field, so we would like to > build a time series which contains values which are the sum of all > frequencies for documents on that date. So for example, if the term was > "iPhone", we would want a time series which contained the sum of all iPhone > mentions across all buckets, but decomposed into time buckets. > > The approach we have tried is to write a custom Collector as below, but > this seems really, really slow...any way of approaching this differently to > make it perform much better? > > @Override() > public void collect(int docId) throws IOException { > try { > ++collectCount; > if (reader != null) { > final Terms terms = reader.getTermVector(docId, field); > termsEnum = terms.iterator(termsEnum); > final int colIndex = matrix.columns().add(term); > if (termsEnum.seekExact(termRef)) { > final DocsAndPositionsEnum docsAndPositionsEnum = > termsEnum.docsAndPositions(null, null, DocsAndPositionsEnum.FLAG_FREQS); > while (docsAndPositionsEnum.nextDoc() != > DocIdSetIterator.NO_MORE_DOCS){ > final int date = dates.get(docId); > final int freq = docsAndPositionsEnum.freq(); > final int rowIndex = matrix.rows().add(date); > final double value = matrix.getDouble(rowIndex, colIndex); > matrix.setDouble(rowIndex, colIndex, Double.isNaN(value) ? > freq : value + freq); > if (++docCount % 1000 == 0) { > LOG.info("Processed " + docCount + " / " + collectCount > + " documents in term frequency analysis..."); > } > } > } > } > } catch (Throwable t) { > throw new RuntimeException("Failed to collect document " + docId, t); > } > } > > @Override() > public void setNextReader(AtomicReaderContext atomicReaderContext) throws > IOException { > this.reader = atomicReaderContext.reader(); > this.dates = FieldCache.DEFAULT.getInts(reader, "date", false); > } > > Any help would be much appreciated... > > Thanks, > Zav > > THIS MESSAGE AND ANY ATTACHMENTS ARE CONFIDENTIAL, PROPRIETARY, AND MAY BE > PRIVILEGED. If this message was misdirected, BlackRock, Inc. and its > subsidiaries, ("BlackRock") does not waive any confidentiality or > privilege. If you are not the intended recipient, please notify us > immediately and destroy the message without disclosing its contents to > anyone. Any distribution, use or copying of this e-mail or the information > it contains by other than an intended recipient is unauthorized. The views > and opinions expressed in this e-mail message are the author's own and may > not reflect the views and opinions of BlackRock, unless the author is > authorized by BlackRock to express such views or opinions on its behalf. > All email sent to or from this address is subject to electronic storage > and review by BlackRock. Although BlackRock operates anti-virus programs, > it does not accept responsibility for any damage whatsoever caused by > viruses being passed. > > > > -- > BlackRock Advisors (UK) Limited and BlackRock Investment Management (UK) > Limited are authorised and regulated by the Financial Conduct Authority. > Registered in England No. 796793 and No. 2020394 respectively. BlackRock > Life Limited is authorised by the Prudential Regulation Authority and > regulated by the Financial Conduct Authority and Prudential Regulation > Authority. Registered in England No. 2223202. Registered Offices: Drapers > Gardens, 12 Throgmorton Avenue, London EC2N 2DL. BlackRock International > Limited is authorised and regulated by the Financial Conduct Authority and > is a registered investment adviser with the Securities and Exchange > Commission (SEC). Registered in Scotland No. SC160821. Registered Office: > 40 Torphichen Street, Edinburgh, EH3 8JB. > > © 2013 BlackRock, Inc. All Rights reserved.