Hi!

I believe the approach below can help you.
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/misc/src/java/org/apache/lucene/misc/HighFreqTerms.java

Marcio
http://numere.stela.org.br
Go beyond Lucene™ features with Numere®





2014/1/24 Witdouck, Xavier <xavier.witdo...@blackrock.com>

> Hi all,
>
> We have over 6 million documents in our index, and would like to construct
> a term frequency matrix over all 6 million documents as quickly as
> possible.  Each document has a numeric date field, so we would like to
> build a time series which contains values which are the sum of all
> frequencies for documents on that date.  So for example, if the term was
> "iPhone", we would want a time series which contained the sum of all iPhone
> mentions across all buckets, but decomposed into time buckets.
>
> The approach we have tried is to write a custom Collector as below, but
> this seems really, really slow...any way of approaching this differently to
> make it perform much better?
>
> @Override()
> public void collect(int docId) throws IOException {
>     try {
>       ++collectCount;
>       if (reader != null) {
>           final Terms terms = reader.getTermVector(docId, field);
>           termsEnum = terms.iterator(termsEnum);
>           final int colIndex = matrix.columns().add(term);
>           if (termsEnum.seekExact(termRef)) {
>             final DocsAndPositionsEnum docsAndPositionsEnum =
> termsEnum.docsAndPositions(null, null, DocsAndPositionsEnum.FLAG_FREQS);
>             while (docsAndPositionsEnum.nextDoc() !=
> DocIdSetIterator.NO_MORE_DOCS){
>                 final int date = dates.get(docId);
>                 final int freq = docsAndPositionsEnum.freq();
>                 final int rowIndex = matrix.rows().add(date);
>                 final double value = matrix.getDouble(rowIndex, colIndex);
>                 matrix.setDouble(rowIndex, colIndex, Double.isNaN(value) ?
> freq : value + freq);
>                 if (++docCount % 1000 == 0) {
>                   LOG.info("Processed " + docCount + " / " + collectCount
> + " documents in term frequency analysis...");
>                 }
>             }
>           }
>       }
>     } catch (Throwable t) {
>       throw new RuntimeException("Failed to collect document " + docId, t);
>     }
> }
>
> @Override()
> public void setNextReader(AtomicReaderContext atomicReaderContext) throws
> IOException {
>     this.reader = atomicReaderContext.reader();
>     this.dates = FieldCache.DEFAULT.getInts(reader, "date", false);
> }
>
> Any help would be much appreciated...
>
> Thanks,
> Zav
>
> THIS MESSAGE AND ANY ATTACHMENTS ARE CONFIDENTIAL, PROPRIETARY, AND MAY BE
> PRIVILEGED.  If this message was misdirected, BlackRock, Inc. and its
> subsidiaries, ("BlackRock") does not waive any confidentiality or
> privilege.  If you are not the intended recipient, please notify us
> immediately and destroy the message without disclosing its contents to
> anyone.  Any distribution, use or copying of this e-mail or the information
> it contains by other than an intended recipient is unauthorized.  The views
> and opinions expressed in this e-mail message are the author's own and may
> not reflect the views and opinions of BlackRock, unless the author is
> authorized by BlackRock to express such views or opinions on its behalf.
>  All email sent to or from this address is subject to electronic storage
> and review by BlackRock.  Although BlackRock operates anti-virus programs,
> it does not accept responsibility for any damage whatsoever caused by
> viruses being passed.
>
>
>
> --
> BlackRock Advisors (UK) Limited and BlackRock Investment Management (UK)
> Limited are authorised and regulated by the Financial Conduct Authority.
> Registered in England No. 796793 and No. 2020394 respectively. BlackRock
> Life Limited is authorised by the Prudential Regulation Authority and
> regulated by the Financial Conduct Authority and Prudential Regulation
> Authority. Registered in England No. 2223202. Registered Offices: Drapers
> Gardens, 12 Throgmorton Avenue, London EC2N 2DL. BlackRock International
> Limited is authorised and regulated by the Financial Conduct Authority and
> is a registered investment adviser with the Securities and Exchange
> Commission (SEC).  Registered in Scotland No. SC160821. Registered Office:
> 40 Torphichen Street, Edinburgh, EH3 8JB.
>
> © 2013 BlackRock, Inc. All Rights reserved.

Reply via email to