[
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15392141#comment-15392141
]
Cao Manh Dat edited comment on SOLR-9252 at 7/25/16 3:59 PM:
-------------------------------------------------------------
I'm thinking about change *tlogit* to *train* function. Because different
algorithms have different set of parameters. For example : *tlogit* vs *logit*
have totally different parameters. I think we should change *featuresSelection*
to features but keep *tlogit* as it is.
[~joel.bernstein] +1 for sum up the igain score from all shards. So we can get
best terms from all shards. But this is not yet proven because it based on a
lot of assumption about how documents, classes, terms is distributed. also, I
think it will be good enough for most cases. If you dont have any comments, I
will submit a fixed patch soon.
was (Author: caomanhdat):
I'm thinking about change *tlogit* to *train* function. Because different
algorithms have different set of parameters. For example : *tlogit* vs *logit*
have totally different parameters. I think we should change *featuresSelection*
to features but keep *tlogit* as it is.
> Feature selection and logistic regression on text
> -------------------------------------------------
>
> Key: SOLR-9252
> URL: https://issues.apache.org/jira/browse/SOLR-9252
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Cao Manh Dat
> Assignee: Joel Bernstein
> Attachments: SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch,
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, enron1.zip
>
>
> SOLR-9186 come up with a challenges that for each iterative we have to
> rebuild the tf-idf vector for each documents. It is costly computation if we
> represent doc by a lot of terms. Features selection can help reducing the
> computation.
> Due to its computational efficiency and simple interpretation, information
> gain is one of the most popular feature selection methods. It is used to
> measure the dependence between features and labels and calculates the
> information gain between the i-th feature and the class labels
> (http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).
> I confirmed that by running logistics regressions on enron mail dataset (in
> which each email is represented by top 100 terms that have highest
> information gain) and got the accuracy by 92% and precision by 82%.
> This ticket will create two new streaming expression. Both of them use the
> same *parallel iterative framework* as SOLR-8492.
> {code}
> featuresSelection(collection1, q="*:*", field="tv_text", outcome="out_i",
> positiveLabel=1, numTerms=100)
> {code}
> featuresSelection will emit top terms that have highest information gain
> scores. It can be combined with new tlogit stream.
> {code}
> tlogit(collection1, q="*:*",
> featuresSelection(collection1,
> q="*:*",
> field="tv_text",
> outcome="out_i",
> positiveLabel=1,
> numTerms=100),
> field="tv_text",
> outcome="out_i",
> maxIterations=100)
> {code}
> In the iteration n, the text logistics regression will emit nth model, and
> compute the error of (n-1)th model. Because the error will be wrong if we
> compute the error dynamically in each iteration.
> In each iteration tlogit will change learning rate based on error of previous
> iteration. It will increase the learning rate by 5% if error is going down
> and It will decrease the learning rate by 50% if error is going up.
> This will support use cases such as building models for spam detection,
> sentiment analysis and threat detection.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]