[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

Joel Bernstein (JIRA) Thu, 04 Aug 2016 12:06:33 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joel Bernstein updated SOLR-9252:
---------------------------------
    Description: 
This ticket adds two new streaming expressions *features* and *train*

These two functions work together to train a logistic regression model on text, 
from a training set stored in a SolrCloud collection.

The syntax is as follows:

{code}
{code}
train(collection1, q="*:*",
      features(collection1, 
               q="*:*",  
               field="tv_text", 
               outcome="out_i", 
               positiveLabel=1, 
               numTerms=100),
      field="tv_text",
      outcome="out_i",
      maxIterations=100)
{code}
{code}

The *features* function extracts the feature terms from a training set using 
*information gain* to score the terms. 
http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf

The *train* function uses the extracted features to train a logistic regression 
model on a text field in the training set.

Both the features and the models can be stored in a SolrCloud collection. Using 
this approach Solr can hold millions of models which can be selectively 
deployed.









  was:
SOLR-9186 come up with a challenges that for each iterative we have to rebuild 
the tf-idf vector for each documents. It is costly computation if we represent 
doc by a lot of terms. Features selection can help reducing the computation.

Due to its computational efficiency and simple interpretation, information gain 
is one of the most popular feature selection methods. It is used to measure the 
dependence between features and labels and calculates the information gain 
between the i-th feature and the class labels 
(http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).

I confirmed that by running logistics regressions on enron mail dataset (in 
which each email is represented by top 100 terms that have highest information 
gain) and got the accuracy by 92% and precision by 82%.

This ticket will create two new streaming expression. Both of them use the same 
*parallel iterative framework* as SOLR-8492.

{code}
featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", 
positiveLabel=1, numTerms=100)
{code}

featuresSelection will emit top terms that have highest information gain 
scores. It can be combined with new tlogit stream.

{code}
tlogit(collection1, q="*:*",
         featuresSelection(collection1, 
                                      q="*:*",  
                                      field="tv_text", 
                                      outcome="out_i", 
                                      positiveLabel=1, 
                                      numTerms=100),
         field="tv_text",
         outcome="out_i",
         maxIterations=100)
{code}

In the iteration n, the text logistics regression will emit nth model, and 
compute the error of (n-1)th model. Because the error will be wrong if we 
compute the error dynamically in each iteration. 
In each iteration tlogit will change learning rate based on error of previous 
iteration. It will increase the learning rate by 5% if error is going down and 
It will decrease the learning rate by 50% if error is going up.

This will support use cases such as building models for spam detection, 
sentiment analysis and threat detection. 


> Feature selection and logistic regression on text
> -------------------------------------------------
>
>                 Key: SOLR-9252
>                 URL: https://issues.apache.org/jira/browse/SOLR-9252
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: search, SolrCloud, SolrJ
>            Reporter: Cao Manh Dat
>            Assignee: Joel Bernstein
>              Labels: Streaming
>             Fix For: 6.2
>
>         Attachments: SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch
>
>
> This ticket adds two new streaming expressions *features* and *train*
> These two functions work together to train a logistic regression model on 
> text, from a training set stored in a SolrCloud collection.
> The syntax is as follows:
> {code}
> {code}
> train(collection1, q="*:*",
>       features(collection1, 
>                q="*:*",  
>                field="tv_text", 
>                outcome="out_i", 
>                positiveLabel=1, 
>                numTerms=100),
>       field="tv_text",
>       outcome="out_i",
>       maxIterations=100)
> {code}
> {code}
> The *features* function extracts the feature terms from a training set using 
> *information gain* to score the terms. 
> http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf
> The *train* function uses the extracted features to train a logistic 
> regression model on a text field in the training set.
> Both the features and the models can be stored in a SolrCloud collection. 
> Using this approach Solr can hold millions of models which can be selectively 
> deployed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

Reply via email to