I got the patch before JIRA was down, and just saw another thing:
+ private double countInClassC(String c) throws IOException {
+ TopDocs topDocs = indexSearcher.search(new TermQuery(new
Term(classFieldName, c)), Integer.MAX_VALUE);
+ int res = 0;
+ for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
+ Fields termVectors =
indexSearcher.getIndexReader().getTermVectors(scoreDoc.doc);
+ if (termVectors != null) {
+ res += termVectors.terms(textFieldName).size();
+ } else {
+ // TODO : warn about not existing term vectors for field
'textFieldName'
+ }
+ }
+ return res;
+ }
For this part, I am unsure what the statistic is you are driving for:
It seems currently that it takes all documents that have term c in
field classFieldName, and sums the number of unique terms each doc has
that in field classFieldName?
If this is really what you want and you need 100% exact numbers, just
like the other computation i would not do a search with a PQ of
Integer.MAX_VALUE, but instead just iterate over a DocsEnum for that
term.
But if a good approximation is ok, I would do this, which is instant
and needs no term vectors:
Terms terms = MultiFields.getTerms(reader, classFieldName);
long numPostings = terms.getSumDocFreq(); // number of term/doc pairs
double avgNumberOfUniqueTerms = numPostings /
(double)terms.getDocCount(); // avg # of unique terms per doc
return avgNumberOfUniqueTerms * reader.docFreq(c); // avg # of
unique terms per doc * # docs with c
On Fri, Aug 10, 2012 at 8:36 AM, Tommaso Teofili (JIRA) <[email protected]> wrote:
>
> [
> https://issues.apache.org/jira/browse/SOLR-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
>
> Tommaso Teofili updated SOLR-3700:
> ----------------------------------
>
> Attachment: SOLR-3700_2.patch
>
> new patch incorporating Robert's suggestions (plus added a couple more TODOs)
>
>> Create a Classification component
>> ---------------------------------
>>
>> Key: SOLR-3700
>> URL: https://issues.apache.org/jira/browse/SOLR-3700
>> Project: Solr
>> Issue Type: New Feature
>> Reporter: Tommaso Teofili
>> Priority: Minor
>> Attachments: SOLR-3700.patch, SOLR-3700_2.patch
>>
>>
>> Lucene/Solr can host huge sets of documents containing lots of information
>> in fields so that these can be used as training examples (w/ features) in
>> order to very quickly create classifiers algorithms to use on new documents
>> and / or to provide an additional service.
>> So the idea is to create a contrib module (called 'classification') to host
>> a ClassificationComponent that will use already seen data (the indexed
>> documents / fields) to classify new documents / text fragments.
>> The first version will contain a (simplistic) Lucene based Naive Bayes
>> classifier but more implementations should be added in the future.
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators:
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
--
lucidimagination.com
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]