Hi all,
We have over 6 million documents in our index, and would like to construct a
term frequency matrix over all 6 million documents as quickly as possible.
Each document has a numeric date field, so we would like to build a time series
which contains values which are the sum of all frequencies for documents on
that date. So for example, if the term was "iPhone", we would want a time
series which contained the sum of all iPhone mentions across all buckets, but
decomposed into time buckets.
The approach we have tried is to write a custom Collector as below, but this
seems really, really slow...any way of approaching this differently to make it
perform much better?
@Override()
public void collect(int docId) throws IOException {
try {
++collectCount;
if (reader != null) {
final Terms terms = reader.getTermVector(docId, field);
termsEnum = terms.iterator(termsEnum);
final int colIndex = matrix.columns().add(term);
if (termsEnum.seekExact(termRef)) {
final DocsAndPositionsEnum docsAndPositionsEnum =
termsEnum.docsAndPositions(null, null, DocsAndPositionsEnum.FLAG_FREQS);
while (docsAndPositionsEnum.nextDoc() !=
DocIdSetIterator.NO_MORE_DOCS){
final int date = dates.get(docId);
final int freq = docsAndPositionsEnum.freq();
final int rowIndex = matrix.rows().add(date);
final double value = matrix.getDouble(rowIndex, colIndex);
matrix.setDouble(rowIndex, colIndex, Double.isNaN(value) ? freq
: value + freq);
if (++docCount % 1000 == 0) {
LOG.info("Processed " + docCount + " / " + collectCount + "
documents in term frequency analysis...");
}
}
}
}
} catch (Throwable t) {
throw new RuntimeException("Failed to collect document " + docId, t);
}
}
@Override()
public void setNextReader(AtomicReaderContext atomicReaderContext) throws
IOException {
this.reader = atomicReaderContext.reader();
this.dates = FieldCache.DEFAULT.getInts(reader, "date", false);
}
Any help would be much appreciated...
Thanks,
Zav
THIS MESSAGE AND ANY ATTACHMENTS ARE CONFIDENTIAL, PROPRIETARY, AND MAY BE
PRIVILEGED. If this message was misdirected, BlackRock, Inc. and its
subsidiaries, ("BlackRock") does not waive any confidentiality or privilege.
If you are not the intended recipient, please notify us immediately and destroy
the message without disclosing its contents to anyone. Any distribution, use
or copying of this e-mail or the information it contains by other than an
intended recipient is unauthorized. The views and opinions expressed in this
e-mail message are the author's own and may not reflect the views and opinions
of BlackRock, unless the author is authorized by BlackRock to express such
views or opinions on its behalf. All email sent to or from this address is
subject to electronic storage and review by BlackRock. Although BlackRock
operates anti-virus programs, it does not accept responsibility for any damage
whatsoever caused by viruses being passed.
--
BlackRock Advisors (UK) Limited and BlackRock Investment Management (UK)
Limited are authorised and regulated by the Financial Conduct Authority.
Registered in England No. 796793 and No. 2020394 respectively. BlackRock Life
Limited is authorised by the Prudential Regulation Authority and regulated by
the Financial Conduct Authority and Prudential Regulation Authority. Registered
in England No. 2223202. Registered Offices: Drapers Gardens, 12 Throgmorton
Avenue, London EC2N 2DL. BlackRock International Limited is authorised and
regulated by the Financial Conduct Authority and is a registered investment
adviser with the Securities and Exchange Commission (SEC). Registered in
Scotland No. SC160821. Registered Office: 40 Torphichen Street, Edinburgh, EH3
8JB.
© 2013 BlackRock, Inc. All Rights reserved.