[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-08-11 Thread Cao Manh Dat (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cao Manh Dat updated SOLR-9252:
---
Attachment: SOLR-9299-1.patch

A minor patch :
- In training step, we will ignore document that dont have any given features.
- Add regularization for logit ( 
http://www.holehouse.org/mlclass/07_Regularization.html )

> Feature selection and logistic regression on text
> -
>
> Key: SOLR-9252
> URL: https://issues.apache.org/jira/browse/SOLR-9252
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search, SolrCloud, SolrJ
>Reporter: Cao Manh Dat
>Assignee: Joel Bernstein
>  Labels: Streaming
> Fix For: 6.2
>
> Attachments: SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9299-1.patch
>
>
> This ticket adds two new streaming expressions: *features* and *train*
> These two functions work together to train a logistic regression model on 
> text, from a training set stored in a SolrCloud collection.
> The syntax is as follows:
> {code}
> train(collection1, q="*:*",
>   features(collection1, 
>q="*:*",  
>field="body", 
>outcome="out_i", 
>positiveLabel=1, 
>numTerms=100),
>   field="body",
>   outcome="out_i",
>   maxIterations=100)
> {code}
> The *features* function extracts the feature terms from a training set using 
> *information gain* to score the terms. 
> http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf
> The *train* function uses the extracted features to train a logistic 
> regression model on a text field in the training set.
> For both *features* and *train* the training set is defined by a query. The 
> doc vectors in the *train* function use tf-idf to represent the terms in the 
> document. The idf is calculated for the specific training set, allowing 
> multiple training sets to be stored in the same collection without polluting 
> the idf. 
> In the *train* function a batch gradient descent approach is used to 
> iteratively train the model.
> Both the *features* and the *train* function are embedded in Solr using the 
> AnalyticsQuery framework. So only the model is transported across the network 
> with each iteration.
> Both the features and the models can be stored in a SolrCloud collection. 
> Using this approach Solr can hold millions of models which can be selectively 
> deployed. For example a model could be trained for each user, to personalize 
> ranking and recommendations.
> Below is the final iteration of a model trained on the Enron Ham/Spam 
> dataset. The model includes the terms and their idfs and weights as well as a 
> classification evaluation describing the accuracy of model on the training 
> set. 
> {code}
> {
>   "idfs_ds": [1.2627703388716238, 1.2043595767152093, 
> 1.3886172425360304, 1.5488587854881268, 1.6127302558747882, 
> 2.1359177807201526, 1.514866246141212, 1.7375701403808523, 
> 1.6166175299631897, 1.756428159015249, 1.7929202354640175, 
> 1.2834893120635762, 1.899442866302021, 1.8639061320252337, 
> 1.7631697575821685, 1.6820002892260415, 1.4411352768194767, 
> 2.103708877350535, 1.2225773869965861, 2.208893321170597, 1.878981794430681, 
> 2.043737027506736, 2.2819184561854864, 2.3264563106163885, 
> 1.9336117619172708, 2.0467265663551024, 1.7386696457142692, 
> 2.468795829515302, 2.069437610615317, 2.6294363202479327, 3.7388303845193307, 
> 2.5446615802900157, 1.7430797961918219, 3.0787440662202736, 
> 1.9579702057493114, 2.289523055570706, 1.5362003886162032, 
> 2.7549569891263763, 3.955894889757158, 2.587435396273302, 3.945844553903657, 
> 1.003513057076781, 3.0416264032637708, 2.248395764146843, 4.018415246738492, 
> 2.2876164773001246, 3.3636289340509933, 1.2438124251270097, 
> 2.733903579928544, 3.439026951535205, 0.6709665389201712, 0.9546224358275518, 
> 2.8080115520822657, 2.477970205791343, 2.2631561797299637, 
> 3.2378087608499606, 0.36177021415584676, 4.1083634834014315, 
> 4.120197941048435, 2.471081544796158, 2.424147775633, 2.92339362620, 
> 2.9269972337044097, 3.2987413118451183, 2.383498249003407, 4.168988105217867, 
> 2.877691472720256, 4.233526626355437, 3.8505343740993316, 2.3264563106163885, 
> 2.6429318017228174, 4.260555298743357, 3.0058372954121855, 
> 3.8688835127675283, 3.021585652380325, 3.0295538220295017, 
> 1.9620882623582288, 3.469610374907285, 3.945844553903657, 3.4821105376715167, 
> 4.3169082352944885, 2.520329479630485, 3.609372317282444, 3.070375816549757, 
> 4.220281399605417, 3.985484239117, 

[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-08-04 Thread Joel Bernstein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Bernstein updated SOLR-9252:
-
Description: 
This ticket adds two new streaming expressions: *features* and *train*

These two functions work together to train a logistic regression model on text, 
from a training set stored in a SolrCloud collection.

The syntax is as follows:

{code}
train(collection1, q="*:*",
  features(collection1, 
   q="*:*",  
   field="body", 
   outcome="out_i", 
   positiveLabel=1, 
   numTerms=100),
  field="body",
  outcome="out_i",
  maxIterations=100)
{code}

The *features* function extracts the feature terms from a training set using 
*information gain* to score the terms. 
http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf

The *train* function uses the extracted features to train a logistic regression 
model on a text field in the training set.

For both *features* and *train* the training set is defined by a query. The doc 
vectors in the *train* function use tf-idf to represent the terms in the 
document. The idf is calculated for the specific training set, allowing 
multiple training sets to be stored in the same collection without polluting 
the idf. 

In the *train* function a batch gradient descent approach is used to 
iteratively train the model.

Both the *features* and the *train* function are embedded in Solr using the 
AnalyticsQuery framework. So only the model is transported across the network 
with each iteration.

Both the features and the models can be stored in a SolrCloud collection. Using 
this approach Solr can hold millions of models which can be selectively 
deployed. For example a model could be trained for each user, to personalize 
ranking and recommendations.

Below is the final iteration of a model trained on the Enron Ham/Spam dataset. 
The model includes the terms and their idfs and weights as well as a 
classification evaluation describing the accuracy of model on the training set. 

{code}
{
"idfs_ds": [1.2627703388716238, 1.2043595767152093, 
1.3886172425360304, 1.5488587854881268, 1.6127302558747882, 2.1359177807201526, 
1.514866246141212, 1.7375701403808523, 1.6166175299631897, 1.756428159015249, 
1.7929202354640175, 1.2834893120635762, 1.899442866302021, 1.8639061320252337, 
1.7631697575821685, 1.6820002892260415, 1.4411352768194767, 2.103708877350535, 
1.2225773869965861, 2.208893321170597, 1.878981794430681, 2.043737027506736, 
2.2819184561854864, 2.3264563106163885, 1.9336117619172708, 2.0467265663551024, 
1.7386696457142692, 2.468795829515302, 2.069437610615317, 2.6294363202479327, 
3.7388303845193307, 2.5446615802900157, 1.7430797961918219, 3.0787440662202736, 
1.9579702057493114, 2.289523055570706, 1.5362003886162032, 2.7549569891263763, 
3.955894889757158, 2.587435396273302, 3.945844553903657, 1.003513057076781, 
3.0416264032637708, 2.248395764146843, 4.018415246738492, 2.2876164773001246, 
3.3636289340509933, 1.2438124251270097, 2.733903579928544, 3.439026951535205, 
0.6709665389201712, 0.9546224358275518, 2.8080115520822657, 2.477970205791343, 
2.2631561797299637, 3.2378087608499606, 0.36177021415584676, 
4.1083634834014315, 4.120197941048435, 2.471081544796158, 2.424147775633, 
2.92339362620, 2.9269972337044097, 3.2987413118451183, 2.383498249003407, 
4.168988105217867, 2.877691472720256, 4.233526626355437, 3.8505343740993316, 
2.3264563106163885, 2.6429318017228174, 4.260555298743357, 3.0058372954121855, 
3.8688835127675283, 3.021585652380325, 3.0295538220295017, 1.9620882623582288, 
3.469610374907285, 3.945844553903657, 3.4821105376715167, 4.3169082352944885, 
2.520329479630485, 3.609372317282444, 3.070375816549757, 4.220281399605417, 
3.985484239117, 3.6165408067610563, 3.788840805093992, 4.392131656532076, 
4.392131656532076, 2.837281934382379, 3.698984475972131, 4.331507034715641, 
2.360699334038601, 2.7368842080666815, 3.730733174286711, 3.1991566064156816, 
4.4238803548466565, 2.4665153268165767, 3.175736332207583, 3.2378087608499606, 
4.376627469996111, 3.3525177086259226, 3.28315658082842, 4.156565585219309, 
1.6462639699299098, 2.673278958112109, 4.331507034715641, 3.955894889757158, 
2.7764631943473397, 3.0497565293470212, 1.79060004880832, 3.6237610547345436, 
1.6244377066690232, 2.948895919012047, 3.175736332207583, 2.850571166501062, 
4.073677925413541, 2.725014632511298, 3.1573871935393867, 4.562030693327474, 
3.5403794457954922, 4.580722826339627, 4.580722826339627, 3.189722574182323, 
3.1665196771026594, 3.3306589148134234, 1.9745451708435238, 3.3306589148134234, 
2.795272526304836, 3.3415285870503273, 4.407880013500216, 4.4238803548466565, 
2.6902285164258823, 3.668212817305377, 4.543681554659277, 2.559550192783766, 
1.5452257206382456, 2.2631561797299637, 4.659194441781121, 3.2678110111537597, 

[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-08-04 Thread Joel Bernstein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Bernstein updated SOLR-9252:
-
Description: 
This ticket adds two new streaming expressions: *features* and *train*

These two functions work together to train a logistic regression model on text, 
from a training set stored in a SolrCloud collection.

The syntax is as follows:

{code}
train(collection1, q="*:*",
  features(collection1, 
   q="*:*",  
   field="body", 
   outcome="out_i", 
   positiveLabel=1, 
   numTerms=100),
  field="body",
  outcome="out_i",
  maxIterations=100)
{code}

The *features* function extracts the feature terms from a training set using 
*information gain* to score the terms. 
http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf

The *train* function uses the extracted features to train a logistic regression 
model on a text field in the training set.

For both *features* and *train* the training set is defined by a query. The doc 
vectors in the *train* function use tf-idf to represent the terms in the 
document. The idf is calculated for the specific training set, allowing 
multiple training sets to be stored in the same collection without polluting 
the idf. 

In the *train* function a batch gradient descent approach is used to 
iteratively train the model.

Both the *features* and the *train* function are embedded in Solr using the 
AnalyticsQuery framework. So only the model is transported across the network 
with each iteration.

Both the features and the models can be stored in a SolrCloud collection. Using 
this approach Solr can hold millions of models which can be selectively 
deployed.

Below is the final iteration of a model trained on the Enron Ham/Spam dataset. 
The model includes the terms and their idfs and weights as well as a 
classification evaluation describing the accuracy of model on the training set. 

{code}
{
"idfs_ds": [1.2627703388716238, 1.2043595767152093, 
1.3886172425360304, 1.5488587854881268, 1.6127302558747882, 2.1359177807201526, 
1.514866246141212, 1.7375701403808523, 1.6166175299631897, 1.756428159015249, 
1.7929202354640175, 1.2834893120635762, 1.899442866302021, 1.8639061320252337, 
1.7631697575821685, 1.6820002892260415, 1.4411352768194767, 2.103708877350535, 
1.2225773869965861, 2.208893321170597, 1.878981794430681, 2.043737027506736, 
2.2819184561854864, 2.3264563106163885, 1.9336117619172708, 2.0467265663551024, 
1.7386696457142692, 2.468795829515302, 2.069437610615317, 2.6294363202479327, 
3.7388303845193307, 2.5446615802900157, 1.7430797961918219, 3.0787440662202736, 
1.9579702057493114, 2.289523055570706, 1.5362003886162032, 2.7549569891263763, 
3.955894889757158, 2.587435396273302, 3.945844553903657, 1.003513057076781, 
3.0416264032637708, 2.248395764146843, 4.018415246738492, 2.2876164773001246, 
3.3636289340509933, 1.2438124251270097, 2.733903579928544, 3.439026951535205, 
0.6709665389201712, 0.9546224358275518, 2.8080115520822657, 2.477970205791343, 
2.2631561797299637, 3.2378087608499606, 0.36177021415584676, 
4.1083634834014315, 4.120197941048435, 2.471081544796158, 2.424147775633, 
2.92339362620, 2.9269972337044097, 3.2987413118451183, 2.383498249003407, 
4.168988105217867, 2.877691472720256, 4.233526626355437, 3.8505343740993316, 
2.3264563106163885, 2.6429318017228174, 4.260555298743357, 3.0058372954121855, 
3.8688835127675283, 3.021585652380325, 3.0295538220295017, 1.9620882623582288, 
3.469610374907285, 3.945844553903657, 3.4821105376715167, 4.3169082352944885, 
2.520329479630485, 3.609372317282444, 3.070375816549757, 4.220281399605417, 
3.985484239117, 3.6165408067610563, 3.788840805093992, 4.392131656532076, 
4.392131656532076, 2.837281934382379, 3.698984475972131, 4.331507034715641, 
2.360699334038601, 2.7368842080666815, 3.730733174286711, 3.1991566064156816, 
4.4238803548466565, 2.4665153268165767, 3.175736332207583, 3.2378087608499606, 
4.376627469996111, 3.3525177086259226, 3.28315658082842, 4.156565585219309, 
1.6462639699299098, 2.673278958112109, 4.331507034715641, 3.955894889757158, 
2.7764631943473397, 3.0497565293470212, 1.79060004880832, 3.6237610547345436, 
1.6244377066690232, 2.948895919012047, 3.175736332207583, 2.850571166501062, 
4.073677925413541, 2.725014632511298, 3.1573871935393867, 4.562030693327474, 
3.5403794457954922, 4.580722826339627, 4.580722826339627, 3.189722574182323, 
3.1665196771026594, 3.3306589148134234, 1.9745451708435238, 3.3306589148134234, 
2.795272526304836, 3.3415285870503273, 4.407880013500216, 4.4238803548466565, 
2.6902285164258823, 3.668212817305377, 4.543681554659277, 2.559550192783766, 
1.5452257206382456, 2.2631561797299637, 4.659194441781121, 3.2678110111537597, 
3.878185905429842, 3.3525177086259226, 3.374865007317919, 3.780330115426083, 
4.376627469996111, 

[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-08-04 Thread Joel Bernstein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Bernstein updated SOLR-9252:
-
Description: 
This ticket adds two new streaming expressions: *features* and *train*

These two functions work together to train a logistic regression model on text, 
from a training set stored in a SolrCloud collection.

The syntax is as follows:

{code}
train(collection1, q="*:*",
  features(collection1, 
   q="*:*",  
   field="body", 
   outcome="out_i", 
   positiveLabel=1, 
   numTerms=100),
  field="body",
  outcome="out_i",
  maxIterations=100)
{code}

The *features* function extracts the feature terms from a training set using 
*information gain* to score the terms. 
http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf

The *train* function uses the extracted features to train a logistic regression 
model on a text field in the training set.

For both *features* and *train* the training set is defined by a query. The doc 
vectors in the *train* function use tf-idf to represent the terms in the 
document. The idf is calculated for the specific training set so, multiple 
training sets can be stored in the same collection without polluting the idf. 

In the *train* function a batch gradient descent approach is used to 
iteratively train the model.

Both the *features* and the *train* function are embedded in Solr using the 
AnalyticsQuery framework. So only the model is transported across the network 
with each iteration.

Both the features and the models can be stored in a SolrCloud collection. Using 
this approach Solr can hold millions of models which can be selectively 
deployed.

Below is the final iteration of a model trained on the Enron Ham/Spam dataset. 
The model includes the terms and their idfs and weights as well as a 
classification evaluation describing the accuracy of model on the training set. 

{code}
{
"idfs_ds": [1.2627703388716238, 1.2043595767152093, 
1.3886172425360304, 1.5488587854881268, 1.6127302558747882, 2.1359177807201526, 
1.514866246141212, 1.7375701403808523, 1.6166175299631897, 1.756428159015249, 
1.7929202354640175, 1.2834893120635762, 1.899442866302021, 1.8639061320252337, 
1.7631697575821685, 1.6820002892260415, 1.4411352768194767, 2.103708877350535, 
1.2225773869965861, 2.208893321170597, 1.878981794430681, 2.043737027506736, 
2.2819184561854864, 2.3264563106163885, 1.9336117619172708, 2.0467265663551024, 
1.7386696457142692, 2.468795829515302, 2.069437610615317, 2.6294363202479327, 
3.7388303845193307, 2.5446615802900157, 1.7430797961918219, 3.0787440662202736, 
1.9579702057493114, 2.289523055570706, 1.5362003886162032, 2.7549569891263763, 
3.955894889757158, 2.587435396273302, 3.945844553903657, 1.003513057076781, 
3.0416264032637708, 2.248395764146843, 4.018415246738492, 2.2876164773001246, 
3.3636289340509933, 1.2438124251270097, 2.733903579928544, 3.439026951535205, 
0.6709665389201712, 0.9546224358275518, 2.8080115520822657, 2.477970205791343, 
2.2631561797299637, 3.2378087608499606, 0.36177021415584676, 
4.1083634834014315, 4.120197941048435, 2.471081544796158, 2.424147775633, 
2.92339362620, 2.9269972337044097, 3.2987413118451183, 2.383498249003407, 
4.168988105217867, 2.877691472720256, 4.233526626355437, 3.8505343740993316, 
2.3264563106163885, 2.6429318017228174, 4.260555298743357, 3.0058372954121855, 
3.8688835127675283, 3.021585652380325, 3.0295538220295017, 1.9620882623582288, 
3.469610374907285, 3.945844553903657, 3.4821105376715167, 4.3169082352944885, 
2.520329479630485, 3.609372317282444, 3.070375816549757, 4.220281399605417, 
3.985484239117, 3.6165408067610563, 3.788840805093992, 4.392131656532076, 
4.392131656532076, 2.837281934382379, 3.698984475972131, 4.331507034715641, 
2.360699334038601, 2.7368842080666815, 3.730733174286711, 3.1991566064156816, 
4.4238803548466565, 2.4665153268165767, 3.175736332207583, 3.2378087608499606, 
4.376627469996111, 3.3525177086259226, 3.28315658082842, 4.156565585219309, 
1.6462639699299098, 2.673278958112109, 4.331507034715641, 3.955894889757158, 
2.7764631943473397, 3.0497565293470212, 1.79060004880832, 3.6237610547345436, 
1.6244377066690232, 2.948895919012047, 3.175736332207583, 2.850571166501062, 
4.073677925413541, 2.725014632511298, 3.1573871935393867, 4.562030693327474, 
3.5403794457954922, 4.580722826339627, 4.580722826339627, 3.189722574182323, 
3.1665196771026594, 3.3306589148134234, 1.9745451708435238, 3.3306589148134234, 
2.795272526304836, 3.3415285870503273, 4.407880013500216, 4.4238803548466565, 
2.6902285164258823, 3.668212817305377, 4.543681554659277, 2.559550192783766, 
1.5452257206382456, 2.2631561797299637, 4.659194441781121, 3.2678110111537597, 
3.878185905429842, 3.3525177086259226, 3.374865007317919, 3.780330115426083, 
4.376627469996111, 3.433020927474993, 

[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-08-04 Thread Joel Bernstein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Bernstein updated SOLR-9252:
-
Description: 
This ticket adds two new streaming expressions: *features* and *train*

These two functions work together to train a logistic regression model on text, 
from a training set stored in a SolrCloud collection.

The syntax is as follows:

{code}
train(collection1, q="*:*",
  features(collection1, 
   q="*:*",  
   field="tv_text", 
   outcome="out_i", 
   positiveLabel=1, 
   numTerms=100),
  field="tv_text",
  outcome="out_i",
  maxIterations=100)
{code}

The *features* function extracts the feature terms from a training set using 
*information gain* to score the terms. 
http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf

The *train* function uses the extracted features to train a logistic regression 
model on a text field in the training set.

For both *features* and *train* the training set is defined by a query. The doc 
vectors in the *train* function use tf-idf to represent the terms in the 
document. The idf is calculated for the specific training set so, multiple 
training sets can be stored in the same collection without polluting the idf. 

In the *train* function a batch gradient descent approach is used to 
iteratively train the model.

Both the *features* and the *train* function are embedded in Solr using the 
AnalyticsQuery framework. So only the model is transported across the network 
with each iteration.

Both the features and the models can be stored in a SolrCloud collection. Using 
this approach Solr can hold millions of models which can be selectively 
deployed.

Below is the final iteration of a model trained on the Enron Ham/Spam dataset. 
The model includes the terms and their idfs and weights as well as a 
classification evaluation describing the accuracy of model on the training set. 

{code}
{
"idfs_ds": [1.2627703388716238, 1.2043595767152093, 
1.3886172425360304, 1.5488587854881268, 1.6127302558747882, 2.1359177807201526, 
1.514866246141212, 1.7375701403808523, 1.6166175299631897, 1.756428159015249, 
1.7929202354640175, 1.2834893120635762, 1.899442866302021, 1.8639061320252337, 
1.7631697575821685, 1.6820002892260415, 1.4411352768194767, 2.103708877350535, 
1.2225773869965861, 2.208893321170597, 1.878981794430681, 2.043737027506736, 
2.2819184561854864, 2.3264563106163885, 1.9336117619172708, 2.0467265663551024, 
1.7386696457142692, 2.468795829515302, 2.069437610615317, 2.6294363202479327, 
3.7388303845193307, 2.5446615802900157, 1.7430797961918219, 3.0787440662202736, 
1.9579702057493114, 2.289523055570706, 1.5362003886162032, 2.7549569891263763, 
3.955894889757158, 2.587435396273302, 3.945844553903657, 1.003513057076781, 
3.0416264032637708, 2.248395764146843, 4.018415246738492, 2.2876164773001246, 
3.3636289340509933, 1.2438124251270097, 2.733903579928544, 3.439026951535205, 
0.6709665389201712, 0.9546224358275518, 2.8080115520822657, 2.477970205791343, 
2.2631561797299637, 3.2378087608499606, 0.36177021415584676, 
4.1083634834014315, 4.120197941048435, 2.471081544796158, 2.424147775633, 
2.92339362620, 2.9269972337044097, 3.2987413118451183, 2.383498249003407, 
4.168988105217867, 2.877691472720256, 4.233526626355437, 3.8505343740993316, 
2.3264563106163885, 2.6429318017228174, 4.260555298743357, 3.0058372954121855, 
3.8688835127675283, 3.021585652380325, 3.0295538220295017, 1.9620882623582288, 
3.469610374907285, 3.945844553903657, 3.4821105376715167, 4.3169082352944885, 
2.520329479630485, 3.609372317282444, 3.070375816549757, 4.220281399605417, 
3.985484239117, 3.6165408067610563, 3.788840805093992, 4.392131656532076, 
4.392131656532076, 2.837281934382379, 3.698984475972131, 4.331507034715641, 
2.360699334038601, 2.7368842080666815, 3.730733174286711, 3.1991566064156816, 
4.4238803548466565, 2.4665153268165767, 3.175736332207583, 3.2378087608499606, 
4.376627469996111, 3.3525177086259226, 3.28315658082842, 4.156565585219309, 
1.6462639699299098, 2.673278958112109, 4.331507034715641, 3.955894889757158, 
2.7764631943473397, 3.0497565293470212, 1.79060004880832, 3.6237610547345436, 
1.6244377066690232, 2.948895919012047, 3.175736332207583, 2.850571166501062, 
4.073677925413541, 2.725014632511298, 3.1573871935393867, 4.562030693327474, 
3.5403794457954922, 4.580722826339627, 4.580722826339627, 3.189722574182323, 
3.1665196771026594, 3.3306589148134234, 1.9745451708435238, 3.3306589148134234, 
2.795272526304836, 3.3415285870503273, 4.407880013500216, 4.4238803548466565, 
2.6902285164258823, 3.668212817305377, 4.543681554659277, 2.559550192783766, 
1.5452257206382456, 2.2631561797299637, 4.659194441781121, 3.2678110111537597, 
3.878185905429842, 3.3525177086259226, 3.374865007317919, 3.780330115426083, 
4.376627469996111, 

[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-08-04 Thread Joel Bernstein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Bernstein updated SOLR-9252:
-
Description: 
This ticket adds two new streaming expressions: *features* and *train*

These two functions work together to train a logistic regression model on text, 
from a training set stored in a SolrCloud collection.

The syntax is as follows:

{code}
train(collection1, q="*:*",
  features(collection1, 
   q="*:*",  
   field="tv_text", 
   outcome="out_i", 
   positiveLabel=1, 
   numTerms=100),
  field="tv_text",
  outcome="out_i",
  maxIterations=100)
{code}

The *features* function extracts the feature terms from a training set using 
*information gain* to score the terms. 
http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf

The *train* function uses the extracted features to train a logistic regression 
model on a text field in the training set.

For both *features* and *train* the training set is defined by a query. The doc 
vectors in the *train* function use tf-idf to represent the terms in the 
document. The idf is calculated for the specific training set so, multiple 
training sets can be stored in the same collection without polluting the idf. 

In the *train* function a batch gradient descent approach is used to 
iteratively train the model.

Both the *features* and the *train* function are embedded in Solr using the 
AnalyticsQuery framework. So only the model is transported across the network 
with each iteration.

Both the features and the models can be stored in a SolrCloud collection. Using 
this approach Solr can hold millions of models which can be selectively 
deployed.

Below is the final iteration of a model trained on the Enron Ham/Spam dataset. 
The model includes the terms and their idfs and weights as well a 
classification evaluation describing the accuracy of model on the training set. 

{code}
{
"idfs_ds": [1.2627703388716238, 1.2043595767152093, 
1.3886172425360304, 1.5488587854881268, 1.6127302558747882, 2.1359177807201526, 
1.514866246141212, 1.7375701403808523, 1.6166175299631897, 1.756428159015249, 
1.7929202354640175, 1.2834893120635762, 1.899442866302021, 1.8639061320252337, 
1.7631697575821685, 1.6820002892260415, 1.4411352768194767, 2.103708877350535, 
1.2225773869965861, 2.208893321170597, 1.878981794430681, 2.043737027506736, 
2.2819184561854864, 2.3264563106163885, 1.9336117619172708, 2.0467265663551024, 
1.7386696457142692, 2.468795829515302, 2.069437610615317, 2.6294363202479327, 
3.7388303845193307, 2.5446615802900157, 1.7430797961918219, 3.0787440662202736, 
1.9579702057493114, 2.289523055570706, 1.5362003886162032, 2.7549569891263763, 
3.955894889757158, 2.587435396273302, 3.945844553903657, 1.003513057076781, 
3.0416264032637708, 2.248395764146843, 4.018415246738492, 2.2876164773001246, 
3.3636289340509933, 1.2438124251270097, 2.733903579928544, 3.439026951535205, 
0.6709665389201712, 0.9546224358275518, 2.8080115520822657, 2.477970205791343, 
2.2631561797299637, 3.2378087608499606, 0.36177021415584676, 
4.1083634834014315, 4.120197941048435, 2.471081544796158, 2.424147775633, 
2.92339362620, 2.9269972337044097, 3.2987413118451183, 2.383498249003407, 
4.168988105217867, 2.877691472720256, 4.233526626355437, 3.8505343740993316, 
2.3264563106163885, 2.6429318017228174, 4.260555298743357, 3.0058372954121855, 
3.8688835127675283, 3.021585652380325, 3.0295538220295017, 1.9620882623582288, 
3.469610374907285, 3.945844553903657, 3.4821105376715167, 4.3169082352944885, 
2.520329479630485, 3.609372317282444, 3.070375816549757, 4.220281399605417, 
3.985484239117, 3.6165408067610563, 3.788840805093992, 4.392131656532076, 
4.392131656532076, 2.837281934382379, 3.698984475972131, 4.331507034715641, 
2.360699334038601, 2.7368842080666815, 3.730733174286711, 3.1991566064156816, 
4.4238803548466565, 2.4665153268165767, 3.175736332207583, 3.2378087608499606, 
4.376627469996111, 3.3525177086259226, 3.28315658082842, 4.156565585219309, 
1.6462639699299098, 2.673278958112109, 4.331507034715641, 3.955894889757158, 
2.7764631943473397, 3.0497565293470212, 1.79060004880832, 3.6237610547345436, 
1.6244377066690232, 2.948895919012047, 3.175736332207583, 2.850571166501062, 
4.073677925413541, 2.725014632511298, 3.1573871935393867, 4.562030693327474, 
3.5403794457954922, 4.580722826339627, 4.580722826339627, 3.189722574182323, 
3.1665196771026594, 3.3306589148134234, 1.9745451708435238, 3.3306589148134234, 
2.795272526304836, 3.3415285870503273, 4.407880013500216, 4.4238803548466565, 
2.6902285164258823, 3.668212817305377, 4.543681554659277, 2.559550192783766, 
1.5452257206382456, 2.2631561797299637, 4.659194441781121, 3.2678110111537597, 
3.878185905429842, 3.3525177086259226, 3.374865007317919, 3.780330115426083, 
4.376627469996111, 

[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-08-04 Thread Joel Bernstein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Bernstein updated SOLR-9252:
-
Description: 
This ticket adds two new streaming expressions: *features* and *train*

These two functions work together to train a logistic regression model on text, 
from a training set stored in a SolrCloud collection.

The syntax is as follows:

{code}
train(collection1, q="*:*",
  features(collection1, 
   q="*:*",  
   field="tv_text", 
   outcome="out_i", 
   positiveLabel=1, 
   numTerms=100),
  field="tv_text",
  outcome="out_i",
  maxIterations=100)
{code}

The *features* function extracts the feature terms from a training set using 
*information gain* to score the terms. 
http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf

The *train* function uses the extracted features to train a logistic regression 
model on a text field in the training set.

For both *features* and *train* the training set is defined by a query. The doc 
vectors in the *train* use tf-idf to represent the terms in the document. The 
idf is calculated for the specific training set so, multiple training sets can 
be stored in the same collection without polluting the idf. 

In the *train* function a batch gradient descent approach is used to 
iteratively train the model.

Both the *features* and the *train* function are embedded in Solr using the 
AnalyticsQuery framework. So only the model is transported across the network 
with each iteration.

Both the features and the models can be stored in a SolrCloud collection. Using 
this approach Solr can hold millions of models which can be selectively 
deployed.

Below is the final iteration of a model trained on the Enron Ham/Spam dataset. 
The model includes the terms and their idfs and weights as well a 
classification evaluation describing the accuracy of model on the training set. 

{code}
{
"idfs_ds": [1.2627703388716238, 1.2043595767152093, 
1.3886172425360304, 1.5488587854881268, 1.6127302558747882, 2.1359177807201526, 
1.514866246141212, 1.7375701403808523, 1.6166175299631897, 1.756428159015249, 
1.7929202354640175, 1.2834893120635762, 1.899442866302021, 1.8639061320252337, 
1.7631697575821685, 1.6820002892260415, 1.4411352768194767, 2.103708877350535, 
1.2225773869965861, 2.208893321170597, 1.878981794430681, 2.043737027506736, 
2.2819184561854864, 2.3264563106163885, 1.9336117619172708, 2.0467265663551024, 
1.7386696457142692, 2.468795829515302, 2.069437610615317, 2.6294363202479327, 
3.7388303845193307, 2.5446615802900157, 1.7430797961918219, 3.0787440662202736, 
1.9579702057493114, 2.289523055570706, 1.5362003886162032, 2.7549569891263763, 
3.955894889757158, 2.587435396273302, 3.945844553903657, 1.003513057076781, 
3.0416264032637708, 2.248395764146843, 4.018415246738492, 2.2876164773001246, 
3.3636289340509933, 1.2438124251270097, 2.733903579928544, 3.439026951535205, 
0.6709665389201712, 0.9546224358275518, 2.8080115520822657, 2.477970205791343, 
2.2631561797299637, 3.2378087608499606, 0.36177021415584676, 
4.1083634834014315, 4.120197941048435, 2.471081544796158, 2.424147775633, 
2.92339362620, 2.9269972337044097, 3.2987413118451183, 2.383498249003407, 
4.168988105217867, 2.877691472720256, 4.233526626355437, 3.8505343740993316, 
2.3264563106163885, 2.6429318017228174, 4.260555298743357, 3.0058372954121855, 
3.8688835127675283, 3.021585652380325, 3.0295538220295017, 1.9620882623582288, 
3.469610374907285, 3.945844553903657, 3.4821105376715167, 4.3169082352944885, 
2.520329479630485, 3.609372317282444, 3.070375816549757, 4.220281399605417, 
3.985484239117, 3.6165408067610563, 3.788840805093992, 4.392131656532076, 
4.392131656532076, 2.837281934382379, 3.698984475972131, 4.331507034715641, 
2.360699334038601, 2.7368842080666815, 3.730733174286711, 3.1991566064156816, 
4.4238803548466565, 2.4665153268165767, 3.175736332207583, 3.2378087608499606, 
4.376627469996111, 3.3525177086259226, 3.28315658082842, 4.156565585219309, 
1.6462639699299098, 2.673278958112109, 4.331507034715641, 3.955894889757158, 
2.7764631943473397, 3.0497565293470212, 1.79060004880832, 3.6237610547345436, 
1.6244377066690232, 2.948895919012047, 3.175736332207583, 2.850571166501062, 
4.073677925413541, 2.725014632511298, 3.1573871935393867, 4.562030693327474, 
3.5403794457954922, 4.580722826339627, 4.580722826339627, 3.189722574182323, 
3.1665196771026594, 3.3306589148134234, 1.9745451708435238, 3.3306589148134234, 
2.795272526304836, 3.3415285870503273, 4.407880013500216, 4.4238803548466565, 
2.6902285164258823, 3.668212817305377, 4.543681554659277, 2.559550192783766, 
1.5452257206382456, 2.2631561797299637, 4.659194441781121, 3.2678110111537597, 
3.878185905429842, 3.3525177086259226, 3.374865007317919, 3.780330115426083, 
4.376627469996111, 3.433020927474993, 

[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-08-04 Thread Joel Bernstein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Bernstein updated SOLR-9252:
-
Description: 
This ticket adds two new streaming expressions: *features* and *train*

These two functions work together to train a logistic regression model on text, 
from a training set stored in a SolrCloud collection.

The syntax is as follows:

{code}
train(collection1, q="*:*",
  features(collection1, 
   q="*:*",  
   field="tv_text", 
   outcome="out_i", 
   positiveLabel=1, 
   numTerms=100),
  field="tv_text",
  outcome="out_i",
  maxIterations=100)
{code}

The *features* function extracts the feature terms from a training set using 
*information gain* to score the terms. 
http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf

The *train* function uses the extracted features to train a logistic regression 
model on a text field in the training set.

For both *features* and *train* the training set is defined by a query. The doc 
vectors in the *train* use tf-idf to represent the terms in the document. The 
idf is calculated for the specific training set so, multiple training sets can 
be stored in the same collection without polluting the idf. 

In the *train* function a batch gradient descent approach is used to 
iteratively train the model.

Both the *features* and the *train* function are embedded in Solr using the 
AnalyticsQuery framework. So only the model is transported across the network 
with each iteration.

Both the features and the models can be stored in a SolrCloud collection. Using 
this approach Solr can hold millions of models which can be selectively 
deployed.










  was:
This ticket adds two new streaming expressions: *features* and *train*

These two functions work together to train a logistic regression model on text, 
from a training set stored in a SolrCloud collection.

The syntax is as follows:

{code}
train(collection1, q="*:*",
  features(collection1, 
   q="*:*",  
   field="tv_text", 
   outcome="out_i", 
   positiveLabel=1, 
   numTerms=100),
  field="tv_text",
  outcome="out_i",
  maxIterations=100)
{code}

The *features* function extracts the feature terms from a training set using 
*information gain* to score the terms. 
http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf

The *train* function uses the extracted features to train a logistic regression 
model on a text field in the training set.

Both the features and the models can be stored in a SolrCloud collection. Using 
this approach Solr can hold millions of models which can be selectively 
deployed.










> Feature selection and logistic regression on text
> -
>
> Key: SOLR-9252
> URL: https://issues.apache.org/jira/browse/SOLR-9252
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search, SolrCloud, SolrJ
>Reporter: Cao Manh Dat
>Assignee: Joel Bernstein
>  Labels: Streaming
> Fix For: 6.2
>
> Attachments: SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch
>
>
> This ticket adds two new streaming expressions: *features* and *train*
> These two functions work together to train a logistic regression model on 
> text, from a training set stored in a SolrCloud collection.
> The syntax is as follows:
> {code}
> train(collection1, q="*:*",
>   features(collection1, 
>q="*:*",  
>field="tv_text", 
>outcome="out_i", 
>positiveLabel=1, 
>numTerms=100),
>   field="tv_text",
>   outcome="out_i",
>   maxIterations=100)
> {code}
> The *features* function extracts the feature terms from a training set using 
> *information gain* to score the terms. 
> http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf
> The *train* function uses the extracted features to train a logistic 
> regression model on a text field in the training set.
> For both *features* and *train* the training set is defined by a query. The 
> doc vectors in the *train* use tf-idf to represent the terms in the document. 
> The idf is calculated for the specific training set so, multiple training 
> sets can be stored in the same collection without polluting the idf. 
> In the *train* function a batch gradient descent approach is used to 
> iteratively train the model.
> Both the *features* and the *train* function are embedded in Solr using the 
> AnalyticsQuery framework. So 

[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-08-04 Thread Joel Bernstein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Bernstein updated SOLR-9252:
-
Description: 
This ticket adds two new streaming expressions: *features* and *train*

These two functions work together to train a logistic regression model on text, 
from a training set stored in a SolrCloud collection.

The syntax is as follows:

{code}
train(collection1, q="*:*",
  features(collection1, 
   q="*:*",  
   field="tv_text", 
   outcome="out_i", 
   positiveLabel=1, 
   numTerms=100),
  field="tv_text",
  outcome="out_i",
  maxIterations=100)
{code}

The *features* function extracts the feature terms from a training set using 
*information gain* to score the terms. 
http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf

The *train* function uses the extracted features to train a logistic regression 
model on a text field in the training set.

Both the features and the models can be stored in a SolrCloud collection. Using 
this approach Solr can hold millions of models which can be selectively 
deployed.









  was:
This ticket adds two new streaming expressions *features* and *train*

These two functions work together to train a logistic regression model on text, 
from a training set stored in a SolrCloud collection.

The syntax is as follows:

{code}
train(collection1, q="*:*",
  features(collection1, 
   q="*:*",  
   field="tv_text", 
   outcome="out_i", 
   positiveLabel=1, 
   numTerms=100),
  field="tv_text",
  outcome="out_i",
  maxIterations=100)
{code}

The *features* function extracts the feature terms from a training set using 
*information gain* to score the terms. 
http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf

The *train* function uses the extracted features to train a logistic regression 
model on a text field in the training set.

Both the features and the models can be stored in a SolrCloud collection. Using 
this approach Solr can hold millions of models which can be selectively 
deployed.










> Feature selection and logistic regression on text
> -
>
> Key: SOLR-9252
> URL: https://issues.apache.org/jira/browse/SOLR-9252
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search, SolrCloud, SolrJ
>Reporter: Cao Manh Dat
>Assignee: Joel Bernstein
>  Labels: Streaming
> Fix For: 6.2
>
> Attachments: SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch
>
>
> This ticket adds two new streaming expressions: *features* and *train*
> These two functions work together to train a logistic regression model on 
> text, from a training set stored in a SolrCloud collection.
> The syntax is as follows:
> {code}
> train(collection1, q="*:*",
>   features(collection1, 
>q="*:*",  
>field="tv_text", 
>outcome="out_i", 
>positiveLabel=1, 
>numTerms=100),
>   field="tv_text",
>   outcome="out_i",
>   maxIterations=100)
> {code}
> The *features* function extracts the feature terms from a training set using 
> *information gain* to score the terms. 
> http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf
> The *train* function uses the extracted features to train a logistic 
> regression model on a text field in the training set.
> Both the features and the models can be stored in a SolrCloud collection. 
> Using this approach Solr can hold millions of models which can be selectively 
> deployed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-08-04 Thread Joel Bernstein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Bernstein updated SOLR-9252:
-
Description: 
This ticket adds two new streaming expressions *features* and *train*

These two functions work together to train a logistic regression model on text, 
from a training set stored in a SolrCloud collection.

The syntax is as follows:

{code}
{code}
train(collection1, q="*:*",
  features(collection1, 
   q="*:*",  
   field="tv_text", 
   outcome="out_i", 
   positiveLabel=1, 
   numTerms=100),
  field="tv_text",
  outcome="out_i",
  maxIterations=100)
{code}
{code}

The *features* function extracts the feature terms from a training set using 
*information gain* to score the terms. 
http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf

The *train* function uses the extracted features to train a logistic regression 
model on a text field in the training set.

Both the features and the models can be stored in a SolrCloud collection. Using 
this approach Solr can hold millions of models which can be selectively 
deployed.









  was:
SOLR-9186 come up with a challenges that for each iterative we have to rebuild 
the tf-idf vector for each documents. It is costly computation if we represent 
doc by a lot of terms. Features selection can help reducing the computation.

Due to its computational efficiency and simple interpretation, information gain 
is one of the most popular feature selection methods. It is used to measure the 
dependence between features and labels and calculates the information gain 
between the i-th feature and the class labels 
(http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).

I confirmed that by running logistics regressions on enron mail dataset (in 
which each email is represented by top 100 terms that have highest information 
gain) and got the accuracy by 92% and precision by 82%.

This ticket will create two new streaming expression. Both of them use the same 
*parallel iterative framework* as SOLR-8492.

{code}
featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", 
positiveLabel=1, numTerms=100)
{code}

featuresSelection will emit top terms that have highest information gain 
scores. It can be combined with new tlogit stream.

{code}
tlogit(collection1, q="*:*",
 featuresSelection(collection1, 
  q="*:*",  
  field="tv_text", 
  outcome="out_i", 
  positiveLabel=1, 
  numTerms=100),
 field="tv_text",
 outcome="out_i",
 maxIterations=100)
{code}

In the iteration n, the text logistics regression will emit nth model, and 
compute the error of (n-1)th model. Because the error will be wrong if we 
compute the error dynamically in each iteration. 
In each iteration tlogit will change learning rate based on error of previous 
iteration. It will increase the learning rate by 5% if error is going down and 
It will decrease the learning rate by 50% if error is going up.

This will support use cases such as building models for spam detection, 
sentiment analysis and threat detection. 


> Feature selection and logistic regression on text
> -
>
> Key: SOLR-9252
> URL: https://issues.apache.org/jira/browse/SOLR-9252
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search, SolrCloud, SolrJ
>Reporter: Cao Manh Dat
>Assignee: Joel Bernstein
>  Labels: Streaming
> Fix For: 6.2
>
> Attachments: SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch
>
>
> This ticket adds two new streaming expressions *features* and *train*
> These two functions work together to train a logistic regression model on 
> text, from a training set stored in a SolrCloud collection.
> The syntax is as follows:
> {code}
> {code}
> train(collection1, q="*:*",
>   features(collection1, 
>q="*:*",  
>field="tv_text", 
>outcome="out_i", 
>positiveLabel=1, 
>numTerms=100),
>   field="tv_text",
>   outcome="out_i",
>   maxIterations=100)
> {code}
> {code}
> The *features* function extracts the feature terms from a training set using 
> *information gain* to score the terms. 
> http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf
> The *train* function uses the extracted features to train 

[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-08-04 Thread Joel Bernstein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Bernstein updated SOLR-9252:
-
Description: 
This ticket adds two new streaming expressions *features* and *train*

These two functions work together to train a logistic regression model on text, 
from a training set stored in a SolrCloud collection.

The syntax is as follows:

{code}
train(collection1, q="*:*",
  features(collection1, 
   q="*:*",  
   field="tv_text", 
   outcome="out_i", 
   positiveLabel=1, 
   numTerms=100),
  field="tv_text",
  outcome="out_i",
  maxIterations=100)
{code}

The *features* function extracts the feature terms from a training set using 
*information gain* to score the terms. 
http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf

The *train* function uses the extracted features to train a logistic regression 
model on a text field in the training set.

Both the features and the models can be stored in a SolrCloud collection. Using 
this approach Solr can hold millions of models which can be selectively 
deployed.









  was:
This ticket adds two new streaming expressions *features* and *train*

These two functions work together to train a logistic regression model on text, 
from a training set stored in a SolrCloud collection.

The syntax is as follows:

{code}
{code}
train(collection1, q="*:*",
  features(collection1, 
   q="*:*",  
   field="tv_text", 
   outcome="out_i", 
   positiveLabel=1, 
   numTerms=100),
  field="tv_text",
  outcome="out_i",
  maxIterations=100)
{code}
{code}

The *features* function extracts the feature terms from a training set using 
*information gain* to score the terms. 
http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf

The *train* function uses the extracted features to train a logistic regression 
model on a text field in the training set.

Both the features and the models can be stored in a SolrCloud collection. Using 
this approach Solr can hold millions of models which can be selectively 
deployed.










> Feature selection and logistic regression on text
> -
>
> Key: SOLR-9252
> URL: https://issues.apache.org/jira/browse/SOLR-9252
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search, SolrCloud, SolrJ
>Reporter: Cao Manh Dat
>Assignee: Joel Bernstein
>  Labels: Streaming
> Fix For: 6.2
>
> Attachments: SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch
>
>
> This ticket adds two new streaming expressions *features* and *train*
> These two functions work together to train a logistic regression model on 
> text, from a training set stored in a SolrCloud collection.
> The syntax is as follows:
> {code}
> train(collection1, q="*:*",
>   features(collection1, 
>q="*:*",  
>field="tv_text", 
>outcome="out_i", 
>positiveLabel=1, 
>numTerms=100),
>   field="tv_text",
>   outcome="out_i",
>   maxIterations=100)
> {code}
> The *features* function extracts the feature terms from a training set using 
> *information gain* to score the terms. 
> http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf
> The *train* function uses the extracted features to train a logistic 
> regression model on a text field in the training set.
> Both the features and the models can be stored in a SolrCloud collection. 
> Using this approach Solr can hold millions of models which can be selectively 
> deployed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-08-03 Thread Joel Bernstein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Bernstein updated SOLR-9252:
-
Attachment: (was: enron1.zip)

> Feature selection and logistic regression on text
> -
>
> Key: SOLR-9252
> URL: https://issues.apache.org/jira/browse/SOLR-9252
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search, SolrCloud, SolrJ
>Reporter: Cao Manh Dat
>Assignee: Joel Bernstein
>  Labels: Streaming
> Fix For: 6.2
>
> Attachments: SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch
>
>
> SOLR-9186 come up with a challenges that for each iterative we have to 
> rebuild the tf-idf vector for each documents. It is costly computation if we 
> represent doc by a lot of terms. Features selection can help reducing the 
> computation.
> Due to its computational efficiency and simple interpretation, information 
> gain is one of the most popular feature selection methods. It is used to 
> measure the dependence between features and labels and calculates the 
> information gain between the i-th feature and the class labels 
> (http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).
> I confirmed that by running logistics regressions on enron mail dataset (in 
> which each email is represented by top 100 terms that have highest 
> information gain) and got the accuracy by 92% and precision by 82%.
> This ticket will create two new streaming expression. Both of them use the 
> same *parallel iterative framework* as SOLR-8492.
> {code}
> featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", 
> positiveLabel=1, numTerms=100)
> {code}
> featuresSelection will emit top terms that have highest information gain 
> scores. It can be combined with new tlogit stream.
> {code}
> tlogit(collection1, q="*:*",
>  featuresSelection(collection1, 
>   q="*:*",  
>   field="tv_text", 
>   outcome="out_i", 
>   positiveLabel=1, 
>   numTerms=100),
>  field="tv_text",
>  outcome="out_i",
>  maxIterations=100)
> {code}
> In the iteration n, the text logistics regression will emit nth model, and 
> compute the error of (n-1)th model. Because the error will be wrong if we 
> compute the error dynamically in each iteration. 
> In each iteration tlogit will change learning rate based on error of previous 
> iteration. It will increase the learning rate by 5% if error is going down 
> and It will decrease the learning rate by 50% if error is going up.
> This will support use cases such as building models for spam detection, 
> sentiment analysis and threat detection. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-08-03 Thread Joel Bernstein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Bernstein updated SOLR-9252:
-
Component/s: SolrCloud
 search

> Feature selection and logistic regression on text
> -
>
> Key: SOLR-9252
> URL: https://issues.apache.org/jira/browse/SOLR-9252
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search, SolrCloud, SolrJ
>Reporter: Cao Manh Dat
>Assignee: Joel Bernstein
>  Labels: Streaming
> Fix For: 6.2
>
> Attachments: SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, enron1.zip
>
>
> SOLR-9186 come up with a challenges that for each iterative we have to 
> rebuild the tf-idf vector for each documents. It is costly computation if we 
> represent doc by a lot of terms. Features selection can help reducing the 
> computation.
> Due to its computational efficiency and simple interpretation, information 
> gain is one of the most popular feature selection methods. It is used to 
> measure the dependence between features and labels and calculates the 
> information gain between the i-th feature and the class labels 
> (http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).
> I confirmed that by running logistics regressions on enron mail dataset (in 
> which each email is represented by top 100 terms that have highest 
> information gain) and got the accuracy by 92% and precision by 82%.
> This ticket will create two new streaming expression. Both of them use the 
> same *parallel iterative framework* as SOLR-8492.
> {code}
> featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", 
> positiveLabel=1, numTerms=100)
> {code}
> featuresSelection will emit top terms that have highest information gain 
> scores. It can be combined with new tlogit stream.
> {code}
> tlogit(collection1, q="*:*",
>  featuresSelection(collection1, 
>   q="*:*",  
>   field="tv_text", 
>   outcome="out_i", 
>   positiveLabel=1, 
>   numTerms=100),
>  field="tv_text",
>  outcome="out_i",
>  maxIterations=100)
> {code}
> In the iteration n, the text logistics regression will emit nth model, and 
> compute the error of (n-1)th model. Because the error will be wrong if we 
> compute the error dynamically in each iteration. 
> In each iteration tlogit will change learning rate based on error of previous 
> iteration. It will increase the learning rate by 5% if error is going down 
> and It will decrease the learning rate by 50% if error is going up.
> This will support use cases such as building models for spam detection, 
> sentiment analysis and threat detection. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-08-03 Thread Joel Bernstein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Bernstein updated SOLR-9252:
-
Fix Version/s: 6.2

> Feature selection and logistic regression on text
> -
>
> Key: SOLR-9252
> URL: https://issues.apache.org/jira/browse/SOLR-9252
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search, SolrCloud, SolrJ
>Reporter: Cao Manh Dat
>Assignee: Joel Bernstein
>  Labels: Streaming
> Fix For: 6.2
>
> Attachments: SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, enron1.zip
>
>
> SOLR-9186 come up with a challenges that for each iterative we have to 
> rebuild the tf-idf vector for each documents. It is costly computation if we 
> represent doc by a lot of terms. Features selection can help reducing the 
> computation.
> Due to its computational efficiency and simple interpretation, information 
> gain is one of the most popular feature selection methods. It is used to 
> measure the dependence between features and labels and calculates the 
> information gain between the i-th feature and the class labels 
> (http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).
> I confirmed that by running logistics regressions on enron mail dataset (in 
> which each email is represented by top 100 terms that have highest 
> information gain) and got the accuracy by 92% and precision by 82%.
> This ticket will create two new streaming expression. Both of them use the 
> same *parallel iterative framework* as SOLR-8492.
> {code}
> featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", 
> positiveLabel=1, numTerms=100)
> {code}
> featuresSelection will emit top terms that have highest information gain 
> scores. It can be combined with new tlogit stream.
> {code}
> tlogit(collection1, q="*:*",
>  featuresSelection(collection1, 
>   q="*:*",  
>   field="tv_text", 
>   outcome="out_i", 
>   positiveLabel=1, 
>   numTerms=100),
>  field="tv_text",
>  outcome="out_i",
>  maxIterations=100)
> {code}
> In the iteration n, the text logistics regression will emit nth model, and 
> compute the error of (n-1)th model. Because the error will be wrong if we 
> compute the error dynamically in each iteration. 
> In each iteration tlogit will change learning rate based on error of previous 
> iteration. It will increase the learning rate by 5% if error is going down 
> and It will decrease the learning rate by 50% if error is going up.
> This will support use cases such as building models for spam detection, 
> sentiment analysis and threat detection. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-08-03 Thread Joel Bernstein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Bernstein updated SOLR-9252:
-
Component/s: SolrJ

> Feature selection and logistic regression on text
> -
>
> Key: SOLR-9252
> URL: https://issues.apache.org/jira/browse/SOLR-9252
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search, SolrCloud, SolrJ
>Reporter: Cao Manh Dat
>Assignee: Joel Bernstein
>  Labels: Streaming
> Fix For: 6.2
>
> Attachments: SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, enron1.zip
>
>
> SOLR-9186 come up with a challenges that for each iterative we have to 
> rebuild the tf-idf vector for each documents. It is costly computation if we 
> represent doc by a lot of terms. Features selection can help reducing the 
> computation.
> Due to its computational efficiency and simple interpretation, information 
> gain is one of the most popular feature selection methods. It is used to 
> measure the dependence between features and labels and calculates the 
> information gain between the i-th feature and the class labels 
> (http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).
> I confirmed that by running logistics regressions on enron mail dataset (in 
> which each email is represented by top 100 terms that have highest 
> information gain) and got the accuracy by 92% and precision by 82%.
> This ticket will create two new streaming expression. Both of them use the 
> same *parallel iterative framework* as SOLR-8492.
> {code}
> featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", 
> positiveLabel=1, numTerms=100)
> {code}
> featuresSelection will emit top terms that have highest information gain 
> scores. It can be combined with new tlogit stream.
> {code}
> tlogit(collection1, q="*:*",
>  featuresSelection(collection1, 
>   q="*:*",  
>   field="tv_text", 
>   outcome="out_i", 
>   positiveLabel=1, 
>   numTerms=100),
>  field="tv_text",
>  outcome="out_i",
>  maxIterations=100)
> {code}
> In the iteration n, the text logistics regression will emit nth model, and 
> compute the error of (n-1)th model. Because the error will be wrong if we 
> compute the error dynamically in each iteration. 
> In each iteration tlogit will change learning rate based on error of previous 
> iteration. It will increase the learning rate by 5% if error is going down 
> and It will decrease the learning rate by 50% if error is going up.
> This will support use cases such as building models for spam detection, 
> sentiment analysis and threat detection. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-08-03 Thread Joel Bernstein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Bernstein updated SOLR-9252:
-
Labels: Streaming  (was: )

> Feature selection and logistic regression on text
> -
>
> Key: SOLR-9252
> URL: https://issues.apache.org/jira/browse/SOLR-9252
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search, SolrCloud, SolrJ
>Reporter: Cao Manh Dat
>Assignee: Joel Bernstein
>  Labels: Streaming
> Fix For: 6.2
>
> Attachments: SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, enron1.zip
>
>
> SOLR-9186 come up with a challenges that for each iterative we have to 
> rebuild the tf-idf vector for each documents. It is costly computation if we 
> represent doc by a lot of terms. Features selection can help reducing the 
> computation.
> Due to its computational efficiency and simple interpretation, information 
> gain is one of the most popular feature selection methods. It is used to 
> measure the dependence between features and labels and calculates the 
> information gain between the i-th feature and the class labels 
> (http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).
> I confirmed that by running logistics regressions on enron mail dataset (in 
> which each email is represented by top 100 terms that have highest 
> information gain) and got the accuracy by 92% and precision by 82%.
> This ticket will create two new streaming expression. Both of them use the 
> same *parallel iterative framework* as SOLR-8492.
> {code}
> featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", 
> positiveLabel=1, numTerms=100)
> {code}
> featuresSelection will emit top terms that have highest information gain 
> scores. It can be combined with new tlogit stream.
> {code}
> tlogit(collection1, q="*:*",
>  featuresSelection(collection1, 
>   q="*:*",  
>   field="tv_text", 
>   outcome="out_i", 
>   positiveLabel=1, 
>   numTerms=100),
>  field="tv_text",
>  outcome="out_i",
>  maxIterations=100)
> {code}
> In the iteration n, the text logistics regression will emit nth model, and 
> compute the error of (n-1)th model. Because the error will be wrong if we 
> compute the error dynamically in each iteration. 
> In each iteration tlogit will change learning rate based on error of previous 
> iteration. It will increase the learning rate by 5% if error is going down 
> and It will decrease the learning rate by 50% if error is going up.
> This will support use cases such as building models for spam detection, 
> sentiment analysis and threat detection. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-08-02 Thread Cao Manh Dat (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cao Manh Dat updated SOLR-9252:
---
Attachment: SOLR-9252.patch

Updated patch which correct the test for textLogitStream.

[~joel.bernstein] In this patch, the testRecord is built from string.

> Feature selection and logistic regression on text
> -
>
> Key: SOLR-9252
> URL: https://issues.apache.org/jira/browse/SOLR-9252
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Joel Bernstein
> Attachments: SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, enron1.zip
>
>
> SOLR-9186 come up with a challenges that for each iterative we have to 
> rebuild the tf-idf vector for each documents. It is costly computation if we 
> represent doc by a lot of terms. Features selection can help reducing the 
> computation.
> Due to its computational efficiency and simple interpretation, information 
> gain is one of the most popular feature selection methods. It is used to 
> measure the dependence between features and labels and calculates the 
> information gain between the i-th feature and the class labels 
> (http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).
> I confirmed that by running logistics regressions on enron mail dataset (in 
> which each email is represented by top 100 terms that have highest 
> information gain) and got the accuracy by 92% and precision by 82%.
> This ticket will create two new streaming expression. Both of them use the 
> same *parallel iterative framework* as SOLR-8492.
> {code}
> featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", 
> positiveLabel=1, numTerms=100)
> {code}
> featuresSelection will emit top terms that have highest information gain 
> scores. It can be combined with new tlogit stream.
> {code}
> tlogit(collection1, q="*:*",
>  featuresSelection(collection1, 
>   q="*:*",  
>   field="tv_text", 
>   outcome="out_i", 
>   positiveLabel=1, 
>   numTerms=100),
>  field="tv_text",
>  outcome="out_i",
>  maxIterations=100)
> {code}
> In the iteration n, the text logistics regression will emit nth model, and 
> compute the error of (n-1)th model. Because the error will be wrong if we 
> compute the error dynamically in each iteration. 
> In each iteration tlogit will change learning rate based on error of previous 
> iteration. It will increase the learning rate by 5% if error is going down 
> and It will decrease the learning rate by 50% if error is going up.
> This will support use cases such as building models for spam detection, 
> sentiment analysis and threat detection. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-08-02 Thread Joel Bernstein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Bernstein updated SOLR-9252:
-
Attachment: SOLR-9252.patch

New patch that asserts the order of terms in the feature selection test.

Also removes the terms parameter from the TextLogitStream and requires a 
features stream.

> Feature selection and logistic regression on text
> -
>
> Key: SOLR-9252
> URL: https://issues.apache.org/jira/browse/SOLR-9252
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Joel Bernstein
> Attachments: SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, enron1.zip
>
>
> SOLR-9186 come up with a challenges that for each iterative we have to 
> rebuild the tf-idf vector for each documents. It is costly computation if we 
> represent doc by a lot of terms. Features selection can help reducing the 
> computation.
> Due to its computational efficiency and simple interpretation, information 
> gain is one of the most popular feature selection methods. It is used to 
> measure the dependence between features and labels and calculates the 
> information gain between the i-th feature and the class labels 
> (http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).
> I confirmed that by running logistics regressions on enron mail dataset (in 
> which each email is represented by top 100 terms that have highest 
> information gain) and got the accuracy by 92% and precision by 82%.
> This ticket will create two new streaming expression. Both of them use the 
> same *parallel iterative framework* as SOLR-8492.
> {code}
> featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", 
> positiveLabel=1, numTerms=100)
> {code}
> featuresSelection will emit top terms that have highest information gain 
> scores. It can be combined with new tlogit stream.
> {code}
> tlogit(collection1, q="*:*",
>  featuresSelection(collection1, 
>   q="*:*",  
>   field="tv_text", 
>   outcome="out_i", 
>   positiveLabel=1, 
>   numTerms=100),
>  field="tv_text",
>  outcome="out_i",
>  maxIterations=100)
> {code}
> In the iteration n, the text logistics regression will emit nth model, and 
> compute the error of (n-1)th model. Because the error will be wrong if we 
> compute the error dynamically in each iteration. 
> In each iteration tlogit will change learning rate based on error of previous 
> iteration. It will increase the learning rate by 5% if error is going down 
> and It will decrease the learning rate by 50% if error is going up.
> This will support use cases such as building models for spam detection, 
> sentiment analysis and threat detection. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-08-01 Thread Joel Bernstein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Bernstein updated SOLR-9252:
-
Attachment: SOLR-9252.patch

New patch adding the idfs to the features.

> Feature selection and logistic regression on text
> -
>
> Key: SOLR-9252
> URL: https://issues.apache.org/jira/browse/SOLR-9252
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Joel Bernstein
> Attachments: SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, enron1.zip
>
>
> SOLR-9186 come up with a challenges that for each iterative we have to 
> rebuild the tf-idf vector for each documents. It is costly computation if we 
> represent doc by a lot of terms. Features selection can help reducing the 
> computation.
> Due to its computational efficiency and simple interpretation, information 
> gain is one of the most popular feature selection methods. It is used to 
> measure the dependence between features and labels and calculates the 
> information gain between the i-th feature and the class labels 
> (http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).
> I confirmed that by running logistics regressions on enron mail dataset (in 
> which each email is represented by top 100 terms that have highest 
> information gain) and got the accuracy by 92% and precision by 82%.
> This ticket will create two new streaming expression. Both of them use the 
> same *parallel iterative framework* as SOLR-8492.
> {code}
> featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", 
> positiveLabel=1, numTerms=100)
> {code}
> featuresSelection will emit top terms that have highest information gain 
> scores. It can be combined with new tlogit stream.
> {code}
> tlogit(collection1, q="*:*",
>  featuresSelection(collection1, 
>   q="*:*",  
>   field="tv_text", 
>   outcome="out_i", 
>   positiveLabel=1, 
>   numTerms=100),
>  field="tv_text",
>  outcome="out_i",
>  maxIterations=100)
> {code}
> In the iteration n, the text logistics regression will emit nth model, and 
> compute the error of (n-1)th model. Because the error will be wrong if we 
> compute the error dynamically in each iteration. 
> In each iteration tlogit will change learning rate based on error of previous 
> iteration. It will increase the learning rate by 5% if error is going down 
> and It will decrease the learning rate by 50% if error is going up.
> This will support use cases such as building models for spam detection, 
> sentiment analysis and threat detection. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-08-01 Thread Joel Bernstein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Bernstein updated SOLR-9252:
-
Attachment: (was: SOLR-9252.patch)

> Feature selection and logistic regression on text
> -
>
> Key: SOLR-9252
> URL: https://issues.apache.org/jira/browse/SOLR-9252
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Joel Bernstein
> Attachments: SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, enron1.zip
>
>
> SOLR-9186 come up with a challenges that for each iterative we have to 
> rebuild the tf-idf vector for each documents. It is costly computation if we 
> represent doc by a lot of terms. Features selection can help reducing the 
> computation.
> Due to its computational efficiency and simple interpretation, information 
> gain is one of the most popular feature selection methods. It is used to 
> measure the dependence between features and labels and calculates the 
> information gain between the i-th feature and the class labels 
> (http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).
> I confirmed that by running logistics regressions on enron mail dataset (in 
> which each email is represented by top 100 terms that have highest 
> information gain) and got the accuracy by 92% and precision by 82%.
> This ticket will create two new streaming expression. Both of them use the 
> same *parallel iterative framework* as SOLR-8492.
> {code}
> featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", 
> positiveLabel=1, numTerms=100)
> {code}
> featuresSelection will emit top terms that have highest information gain 
> scores. It can be combined with new tlogit stream.
> {code}
> tlogit(collection1, q="*:*",
>  featuresSelection(collection1, 
>   q="*:*",  
>   field="tv_text", 
>   outcome="out_i", 
>   positiveLabel=1, 
>   numTerms=100),
>  field="tv_text",
>  outcome="out_i",
>  maxIterations=100)
> {code}
> In the iteration n, the text logistics regression will emit nth model, and 
> compute the error of (n-1)th model. Because the error will be wrong if we 
> compute the error dynamically in each iteration. 
> In each iteration tlogit will change learning rate based on error of previous 
> iteration. It will increase the learning rate by 5% if error is going down 
> and It will decrease the learning rate by 50% if error is going up.
> This will support use cases such as building models for spam detection, 
> sentiment analysis and threat detection. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-07-29 Thread Joel Bernstein (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Bernstein updated SOLR-9252:
-
Attachment: SOLR-9252.patch

New patch with all tests passing

> Feature selection and logistic regression on text
> -
>
> Key: SOLR-9252
> URL: https://issues.apache.org/jira/browse/SOLR-9252
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Joel Bernstein
> Attachments: SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, enron1.zip
>
>
> SOLR-9186 come up with a challenges that for each iterative we have to 
> rebuild the tf-idf vector for each documents. It is costly computation if we 
> represent doc by a lot of terms. Features selection can help reducing the 
> computation.
> Due to its computational efficiency and simple interpretation, information 
> gain is one of the most popular feature selection methods. It is used to 
> measure the dependence between features and labels and calculates the 
> information gain between the i-th feature and the class labels 
> (http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).
> I confirmed that by running logistics regressions on enron mail dataset (in 
> which each email is represented by top 100 terms that have highest 
> information gain) and got the accuracy by 92% and precision by 82%.
> This ticket will create two new streaming expression. Both of them use the 
> same *parallel iterative framework* as SOLR-8492.
> {code}
> featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", 
> positiveLabel=1, numTerms=100)
> {code}
> featuresSelection will emit top terms that have highest information gain 
> scores. It can be combined with new tlogit stream.
> {code}
> tlogit(collection1, q="*:*",
>  featuresSelection(collection1, 
>   q="*:*",  
>   field="tv_text", 
>   outcome="out_i", 
>   positiveLabel=1, 
>   numTerms=100),
>  field="tv_text",
>  outcome="out_i",
>  maxIterations=100)
> {code}
> In the iteration n, the text logistics regression will emit nth model, and 
> compute the error of (n-1)th model. Because the error will be wrong if we 
> compute the error dynamically in each iteration. 
> In each iteration tlogit will change learning rate based on error of previous 
> iteration. It will increase the learning rate by 5% if error is going down 
> and It will decrease the learning rate by 50% if error is going up.
> This will support use cases such as building models for spam detection, 
> sentiment analysis and threat detection. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-07-26 Thread Cao Manh Dat (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cao Manh Dat updated SOLR-9252:
---
Attachment: SOLR-9252.patch

Updated patch. This patch have changed some points
- Do the training in finish() method. It's much faster than previous approach 
(thanks [~joel.bernstein])
- Change *featuresSelection* to *features*
- FeaturesSelectionStream sum up term score from all shard.

> Feature selection and logistic regression on text
> -
>
> Key: SOLR-9252
> URL: https://issues.apache.org/jira/browse/SOLR-9252
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Joel Bernstein
> Attachments: SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, enron1.zip
>
>
> SOLR-9186 come up with a challenges that for each iterative we have to 
> rebuild the tf-idf vector for each documents. It is costly computation if we 
> represent doc by a lot of terms. Features selection can help reducing the 
> computation.
> Due to its computational efficiency and simple interpretation, information 
> gain is one of the most popular feature selection methods. It is used to 
> measure the dependence between features and labels and calculates the 
> information gain between the i-th feature and the class labels 
> (http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).
> I confirmed that by running logistics regressions on enron mail dataset (in 
> which each email is represented by top 100 terms that have highest 
> information gain) and got the accuracy by 92% and precision by 82%.
> This ticket will create two new streaming expression. Both of them use the 
> same *parallel iterative framework* as SOLR-8492.
> {code}
> featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", 
> positiveLabel=1, numTerms=100)
> {code}
> featuresSelection will emit top terms that have highest information gain 
> scores. It can be combined with new tlogit stream.
> {code}
> tlogit(collection1, q="*:*",
>  featuresSelection(collection1, 
>   q="*:*",  
>   field="tv_text", 
>   outcome="out_i", 
>   positiveLabel=1, 
>   numTerms=100),
>  field="tv_text",
>  outcome="out_i",
>  maxIterations=100)
> {code}
> In the iteration n, the text logistics regression will emit nth model, and 
> compute the error of (n-1)th model. Because the error will be wrong if we 
> compute the error dynamically in each iteration. 
> In each iteration tlogit will change learning rate based on error of previous 
> iteration. It will increase the learning rate by 5% if error is going down 
> and It will decrease the learning rate by 50% if error is going up.
> This will support use cases such as building models for spam detection, 
> sentiment analysis and threat detection. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-07-20 Thread Cao Manh Dat (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cao Manh Dat updated SOLR-9252:
---
Attachment: SOLR-9252.patch

It turn out that, the cause of test fail is we create temporary collection and 
do not delete it. So the exception is thrown when we wanna create same 
temporary collection in another test.

> Feature selection and logistic regression on text
> -
>
> Key: SOLR-9252
> URL: https://issues.apache.org/jira/browse/SOLR-9252
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Joel Bernstein
> Attachments: SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, enron1.zip
>
>
> SOLR-9186 come up with a challenges that for each iterative we have to 
> rebuild the tf-idf vector for each documents. It is costly computation if we 
> represent doc by a lot of terms. Features selection can help reducing the 
> computation.
> Due to its computational efficiency and simple interpretation, information 
> gain is one of the most popular feature selection methods. It is used to 
> measure the dependence between features and labels and calculates the 
> information gain between the i-th feature and the class labels 
> (http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).
> I confirmed that by running logistics regressions on enron mail dataset (in 
> which each email is represented by top 100 terms that have highest 
> information gain) and got the accuracy by 92% and precision by 82%.
> This ticket will create two new streaming expression. Both of them use the 
> same *parallel iterative framework* as SOLR-8492.
> {code}
> featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", 
> positiveLabel=1, numTerms=100)
> {code}
> featuresSelection will emit top terms that have highest information gain 
> scores. It can be combined with new tlogit stream.
> {code}
> tlogit(collection1, q="*:*",
>  featuresSelection(collection1, 
>   q="*:*",  
>   field="tv_text", 
>   outcome="out_i", 
>   positiveLabel=1, 
>   numTerms=100),
>  field="tv_text",
>  outcome="out_i",
>  maxIterations=100)
> {code}
> In the iteration n, the text logistics regression will emit nth model, and 
> compute the error of (n-1)th model. Because the error will be wrong if we 
> compute the error dynamically in each iteration. 
> In each iteration tlogit will change learning rate based on error of previous 
> iteration. It will increase the learning rate by 5% if error is going down 
> and It will decrease the learning rate by 50% if error is going up.
> This will support use cases such as building models for spam detection, 
> sentiment analysis and threat detection. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-07-14 Thread Cao Manh Dat (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cao Manh Dat updated SOLR-9252:
---
Attachment: SOLR-9252.patch

Update stream expression to expression test.

> Feature selection and logistic regression on text
> -
>
> Key: SOLR-9252
> URL: https://issues.apache.org/jira/browse/SOLR-9252
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Joel Bernstein
> Attachments: SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, enron1.zip
>
>
> SOLR-9186 come up with a challenges that for each iterative we have to 
> rebuild the tf-idf vector for each documents. It is costly computation if we 
> represent doc by a lot of terms. Features selection can help reducing the 
> computation.
> Due to its computational efficiency and simple interpretation, information 
> gain is one of the most popular feature selection methods. It is used to 
> measure the dependence between features and labels and calculates the 
> information gain between the i-th feature and the class labels 
> (http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).
> I confirmed that by running logistics regressions on enron mail dataset (in 
> which each email is represented by top 100 terms that have highest 
> information gain) and got the accuracy by 92% and precision by 82%.
> This ticket will create two new streaming expression. Both of them use the 
> same *parallel iterative framework* as SOLR-8492.
> {code}
> featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", 
> positiveLabel=1, numTerms=100)
> {code}
> featuresSelection will emit top terms that have highest information gain 
> scores. It can be combined with new tlogit stream.
> {code}
> tlogit(collection1, q="*:*",
>  featuresSelection(collection1, 
>   q="*:*",  
>   field="tv_text", 
>   outcome="out_i", 
>   positiveLabel=1, 
>   numTerms=100),
>  field="tv_text",
>  outcome="out_i",
>  maxIterations=100)
> {code}
> In the iteration n, the text logistics regression will emit nth model, and 
> compute the error of (n-1)th model. Because the error will be wrong if we 
> compute the error dynamically in each iteration. 
> In each iteration tlogit will change learning rate based on error of previous 
> iteration. It will increase the learning rate by 5% if error is going down 
> and It will decrease the learning rate by 50% if error is going up.
> This will support use cases such as building models for spam detection, 
> sentiment analysis and threat detection. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-07-12 Thread Cao Manh Dat (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cao Manh Dat updated SOLR-9252:
---
Attachment: SOLR-9252.patch

Updated patch based on [~joel.bernstein] about numDocs().

> Feature selection and logistic regression on text
> -
>
> Key: SOLR-9252
> URL: https://issues.apache.org/jira/browse/SOLR-9252
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Joel Bernstein
> Attachments: SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, enron1.zip
>
>
> SOLR-9186 come up with a challenges that for each iterative we have to 
> rebuild the tf-idf vector for each documents. It is costly computation if we 
> represent doc by a lot of terms. Features selection can help reducing the 
> computation.
> Due to its computational efficiency and simple interpretation, information 
> gain is one of the most popular feature selection methods. It is used to 
> measure the dependence between features and labels and calculates the 
> information gain between the i-th feature and the class labels 
> (http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).
> I confirmed that by running logistics regressions on enron mail dataset (in 
> which each email is represented by top 100 terms that have highest 
> information gain) and got the accuracy by 92% and precision by 82%.
> This ticket will create two new streaming expression. Both of them use the 
> same *parallel iterative framework* as SOLR-8492.
> {code}
> featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", 
> positiveLabel=1, numTerms=100)
> {code}
> featuresSelection will emit top terms that have highest information gain 
> scores. It can be combined with new tlogit stream.
> {code}
> tlogit(collection1, q="*:*",
>  featuresSelection(collection1, 
>   q="*:*",  
>   field="tv_text", 
>   outcome="out_i", 
>   positiveLabel=1, 
>   numTerms=100),
>  field="tv_text",
>  outcome="out_i",
>  maxIterations=100)
> {code}
> In the iteration n, the text logistics regression will emit nth model, and 
> compute the error of (n-1)th model. Because the error will be wrong if we 
> compute the error dynamically in each iteration. 
> In each iteration tlogit will change learning rate based on error of previous 
> iteration. It will increase the learning rate by 5% if error is going down 
> and It will decrease the learning rate by 50% if error is going up.
> This will support use cases such as building models for spam detection, 
> sentiment analysis and threat detection. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-07-11 Thread Cao Manh Dat (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cao Manh Dat updated SOLR-9252:
---
Attachment: SOLR-9252.patch

Updated patch. This patch includes :
- Support store textLogit & featureSelection by using updateStream.
- TextLogit model now support exact idfs by using SOLR-9243.

> Feature selection and logistic regression on text
> -
>
> Key: SOLR-9252
> URL: https://issues.apache.org/jira/browse/SOLR-9252
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Joel Bernstein
> Attachments: SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> enron1.zip
>
>
> SOLR-9186 come up with a challenges that for each iterative we have to 
> rebuild the tf-idf vector for each documents. It is costly computation if we 
> represent doc by a lot of terms. Features selection can help reducing the 
> computation.
> Due to its computational efficiency and simple interpretation, information 
> gain is one of the most popular feature selection methods. It is used to 
> measure the dependence between features and labels and calculates the 
> information gain between the i-th feature and the class labels 
> (http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).
> I confirmed that by running logistics regressions on enron mail dataset (in 
> which each email is represented by top 100 terms that have highest 
> information gain) and got the accuracy by 92% and precision by 82%.
> This ticket will create two new streaming expression. Both of them use the 
> same *parallel iterative framework* as SOLR-8492.
> {code}
> featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", 
> positiveLabel=1, numTerms=100)
> {code}
> featuresSelection will emit top terms that have highest information gain 
> scores. It can be combined with new tlogit stream.
> {code}
> tlogit(collection1, q="*:*",
>  featuresSelection(collection1, 
>   q="*:*",  
>   field="tv_text", 
>   outcome="out_i", 
>   positiveLabel=1, 
>   numTerms=100),
>  field="tv_text",
>  outcome="out_i",
>  maxIterations=100)
> {code}
> In the iteration n, the text logistics regression will emit nth model, and 
> compute the error of (n-1)th model. Because the error will be wrong if we 
> compute the error dynamically in each iteration. 
> In each iteration tlogit will change learning rate based on error of previous 
> iteration. It will increase the learning rate by 5% if error is going down 
> and It will decrease the learning rate by 50% if error is going up.
> This will support use cases such as building models for spam detection, 
> sentiment analysis and threat detection. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-07-06 Thread Cao Manh Dat (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cao Manh Dat updated SOLR-9252:
---
Assignee: (was: Cao Manh Dat)

> Feature selection and logistic regression on text
> -
>
> Key: SOLR-9252
> URL: https://issues.apache.org/jira/browse/SOLR-9252
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
> Attachments: SOLR-9252.patch, SOLR-9252.patch, enron1.zip
>
>
> SOLR-9186 come up with a challenges that for each iterative we have to 
> rebuild the tf-idf vector for each documents. It is costly computation if we 
> represent doc by a lot of terms. Features selection can help reducing the 
> computation.
> Due to its computational efficiency and simple interpretation, information 
> gain is one of the most popular feature selection methods. It is used to 
> measure the dependence between features and labels and calculates the 
> information gain between the i-th feature and the class labels 
> (http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).
> I confirmed that by running logistics regressions on enron mail dataset (in 
> which each email is represented by top 100 terms that have highest 
> information gain) and got the accuracy by 92% and precision by 82%.
> This ticket will create two new streaming expression. Both of them use the 
> same *parallel iterative framework* as SOLR-8492.
> {code}
> featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", 
> positiveLabel=1, numTerms=100)
> {code}
> featuresSelection will emit top terms that have highest information gain 
> scores. It can be combined with new tlogit stream.
> {code}
> tlogit(collection1, q="*:*",
>  featuresSelection(collection1, 
>   q="*:*",  
>   field="tv_text", 
>   outcome="out_i", 
>   positiveLabel=1, 
>   numTerms=100),
>  field="tv_text",
>  outcome="out_i",
>  maxIterations=100)
> {code}
> In the iteration n, the text logistics regression will emit nth model, and 
> compute the error of (n-1)th model. Because the error will be wrong if we 
> compute the error dynamically in each iteration. 
> In each iteration tlogit will change learning rate based on error of previous 
> iteration. It will increase the learning rate by 5% if error is going down 
> and It will decrease the learning rate by 50% if error is going up.
> This will support use cases such as building models for spam detection, 
> sentiment analysis and threat detection. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-07-02 Thread Cao Manh Dat (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cao Manh Dat updated SOLR-9252:
---
Attachment: SOLR-9252.patch

Updated patch, I changed the features selection formulation to correct one 
(https://en.wikipedia.org/wiki/Information_gain_in_decision_trees). Here are 
the test result of new formulation 
(https://docs.google.com/spreadsheets/d/1BRjFgZDiJPBT51kggcCznoK0ES1-N-RbOIJaoDT3qgM/edit#gid=0).

I thinks the patch is ready now.

> Feature selection and logistic regression on text
> -
>
> Key: SOLR-9252
> URL: https://issues.apache.org/jira/browse/SOLR-9252
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
> Attachments: SOLR-9252.patch, SOLR-9252.patch, enron1.zip
>
>
> SOLR-9186 come up with a challenges that for each iterative we have to 
> rebuild the tf-idf vector for each documents. It is costly computation if we 
> represent doc by a lot of terms. Features selection can help reducing the 
> computation.
> Due to its computational efficiency and simple interpretation, information 
> gain is one of the most popular feature selection methods. It is used to 
> measure the dependence between features and labels and calculates the 
> information gain between the i-th feature and the class labels 
> (http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).
> I confirmed that by running logistics regressions on enron mail dataset (in 
> which each email is represented by top 100 terms that have highest 
> information gain) and got the accuracy by 92% and precision by 82%.
> This ticket will create two new streaming expression. Both of them use the 
> same *parallel iterative framework* as SOLR-8492.
> {code}
> featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", 
> positiveLabel=1, numTerms=100)
> {code}
> featuresSelection will emit top terms that have highest information gain 
> scores. It can be combined with new tlogit stream.
> {code}
> tlogit(collection1, q="*:*",
>  featuresSelection(collection1, 
>   q="*:*",  
>   field="tv_text", 
>   outcome="out_i", 
>   positiveLabel=1, 
>   numTerms=100),
>  field="tv_text",
>  outcome="out_i",
>  maxIterations=100)
> {code}
> In the iteration n, the text logistics regression will emit nth model, and 
> compute the error of (n-1)th model. Because the error will be wrong if we 
> compute the error dynamically in each iteration. 
> In each iteration tlogit will change learning rate based on error of previous 
> iteration. It will increase the learning rate by 5% if error is going down 
> and It will decrease the learning rate by 50% if error is going up.
> This will support use cases such as building models for spam detection, 
> sentiment analysis and threat detection. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-06-26 Thread Cao Manh Dat (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cao Manh Dat updated SOLR-9252:
---
Description: 
SOLR-9186 come up with a challenges that for each iterative we have to rebuild 
the tf-idf vector for each documents. It is costly computation if we represent 
doc by a lot of terms. Features selection can help reducing the computation.

Due to its computational efficiency and simple interpretation, information gain 
is one of the most popular feature selection methods. It is used to measure the 
dependence between features and labels and calculates the information gain 
between the i-th feature and the class labels 
(http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).

I confirmed that by running logistics regressions on enron mail dataset (in 
which each email is represented by top 100 terms that have highest information 
gain) and got the accuracy by 92% and precision by 82%.

This ticket will create two new streaming expression. Both of them use the same 
*parallel iterative framework* as SOLR-8492.

{code}
featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", 
positiveLabel=1, numTerms=100)
{code}

featuresSelection will emit top terms that have highest information gain 
scores. It can be combined with new tlogit stream.

{code}
tlogit(collection1, q="*:*",
 featuresSelection(collection1, 
  q="*:*",  
  field="tv_text", 
  outcome="out_i", 
  positiveLabel=1, 
  numTerms=100),
 field="tv_text",
 outcome="out_i",
 maxIterations=100)
{code}

In the iteration n, the text logistics regression will emit nth model, and 
compute the error of (n-1)th model. Because the error will be wrong if we 
compute the error dynamically in each iteration. 
In each iteration tlogit will change learning rate based on error of previous 
iteration. It will increase the learning rate by 5% if error is going down and 
It will decrease the learning rate by 50% if error is going up.

This will support use cases such as building models for spam detection, 
sentiment analysis and threat detection. 

  was:
SOLR-9186 come up with a challenges that for each iterative we have to rebuild 
the tf-idf vector for each documents. It is costly computation if we represent 
doc by a lot of terms. Features selection can help reducing the computation.

Due to its computational efficiency and simple interpretation, information gain 
is one of the most popular feature selection methods. It is used to measure the 
dependence between features and labels and calculates the information gain 
between the i-th feature and the class labels 
(http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).

I confirmed that by running logistics regressions on enron mail dataset (in 
which each email is represented by top 100 terms that have highest information 
gain) and got the accuracy by 92% and precision by 82%.

This ticket will create two new streaming expression. Both of them use the same 
*parallel iterative framework* as SOLR-8492.

{code}
featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", 
positiveLabel=1, numTerms=100)
{code}

featuresSelection will emit top terms that have highest information gain 
scores. It can be combined with new tlogit stream.

{code}
tlogit(collection1, q="*:*",
 featuresSelection(collection1, 
  q="*:*",  
  field="tv_text", 
  outcome="out_i", 
  positiveLabel=1, 
  numTerms=100),
 field="tv_text",
 outcome="out_i",
 maxIterations=100)
{code}

In the iteration n, the text logistics regression will emit nth model, and 
compute the error of (n-1)th model. Because the error will be wrong if we 
compute the error dynamically in each iteration.

This will support use cases such as building models for spam detection, 
sentiment analysis and threat detection. 


> Feature selection and logistic regression on text
> -
>
> Key: SOLR-9252
> URL: https://issues.apache.org/jira/browse/SOLR-9252
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
> Attachments: SOLR-9252.patch, enron1.zip
>
>
> SOLR-9186 come up with a challenges that for each iterative we have to 
> rebuild the tf-idf vector for each documents. It is costly computation if we 
> represent doc by a lot of terms. Features selection can help reducing the 

[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-06-26 Thread Cao Manh Dat (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cao Manh Dat updated SOLR-9252:
---
Attachment: SOLR-9252.patch

Initial patch. 

> Feature selection and logistic regression on text
> -
>
> Key: SOLR-9252
> URL: https://issues.apache.org/jira/browse/SOLR-9252
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
> Attachments: SOLR-9252.patch, enron1.zip
>
>
> SOLR-9186 come up with a challenges that for each iterative we have to 
> rebuild the tf-idf vector for each documents. It is costly computation if we 
> represent doc by a lot of terms. Features selection can help reducing the 
> computation.
> Due to its computational efficiency and simple interpretation, information 
> gain is one of the most popular feature selection methods. It is used to 
> measure the dependence between features and labels and calculates the 
> information gain between the i-th feature and the class labels 
> (http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).
> I confirmed that by running logistics regressions on enron mail dataset (in 
> which each email is represented by top 100 terms that have highest 
> information gain) and got the accuracy by 92% and precision by 82%.
> This ticket will create two new streaming expression. Both of them use the 
> same *parallel iterative framework* as SOLR-8492.
> {code}
> featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", 
> positiveLabel=1, numTerms=100)
> {code}
> featuresSelection will emit top terms that have highest information gain 
> scores. It can be combined with new tlogit stream.
> {code}
> tlogit(collection1, q="*:*",
>  featuresSelection(collection1, 
>   q="*:*",  
>   field="tv_text", 
>   outcome="out_i", 
>   positiveLabel=1, 
>   numTerms=100),
>  field="tv_text",
>  outcome="out_i",
>  maxIterations=100)
> {code}
> In the iteration n, the text logistics regression will emit nth model, and 
> compute the error of (n-1)th model. Because the error will be wrong if we 
> compute the error dynamically in each iteration.
> This will support use cases such as building models for spam detection, 
> sentiment analysis and threat detection. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-06-26 Thread Cao Manh Dat (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cao Manh Dat updated SOLR-9252:
---
Description: 
SOLR-9186 come up with a challenges that for each iterative we have to rebuild 
the tf-idf vector for each documents. It is costly computation if we represent 
doc by a lot of terms. Features selection can help reducing the computation.

Due to its computational efficiency and simple interpretation, information gain 
is one of the most popular feature selection methods. It is used to measure the 
dependence between features and labels and calculates the information gain 
between the i-th feature and the class labels 
(http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).

I confirmed that by running logistics regressions on enron mail dataset (in 
which each email is represented by top 100 terms that have highest information 
gain) and got the accuracy by 92% and precision by 82%.

This ticket will create two new streaming expression. Both of them use the same 
*parallel iterative framework* as SOLR-8492.

{code}
featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", 
positiveLabel=1, numTerms=100)
{code}

featuresSelection will emit top terms that have highest information gain 
scores. It can be combined with new tlogit stream.

{code}
tlogit(collection1, q="*:*",
 featuresSelection(collection1, 
  q="*:*",  
  field="tv_text", 
  outcome="out_i", 
  positiveLabel=1, 
  numTerms=100),
 field="tv_text",
 outcome="out_i",
 maxIterations=100)
{code}

In the iteration n, the text logistics regression will emit nth model, and 
compute the error of (n-1)th model. Because the error will be wrong if we 
compute the error dynamically in each iteration.

This will support use cases such as building models for spam detection, 
sentiment analysis and threat detection. 

  was:
SOLR-9186 come up with a challenges that for each iterative we have to rebuild 
the tf-idf vector for each documents. It is costly computation if we represent 
doc by a lot of terms. Features selection can help reducing the computation.

Due to its computational efficiency and simple interpretation, information gain 
is one of the most popular feature selection methods. It is used to measure the 
dependence between features and labels and calculates the information gain 
between the i-th feature and the class labels 
(http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).

I confirmed that by running logistics regressions on enron mail dataset (in 
which each email is represented by top 100 terms that have highest information 
gain) and got the accuracy by 92% and precision by 82%.

This ticket will create two new streaming expression. Both of them use the same 
*parallel iterative framework* as SOLR-8492.

{code}
featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", 
positiveLabel=1, numTerms=100)
{code}

featuresSelection will emit top terms that have highest information gain 
scores. It can be combined with new tlogit stream.

{code}
tlogit(collection1, q="*:*",
 featuresSelection(collection1, 
  q="*:*",  
  field="tv_text", 
  outcome="out_i", 
  positiveLabel=1, 
  numTerms=100),
 field="tv_text",
 outcome="out_i",
 maxIterations=100)
{code}

This will support use cases such as building models for spam detection, 
sentiment analysis and threat detection.



> Feature selection and logistic regression on text
> -
>
> Key: SOLR-9252
> URL: https://issues.apache.org/jira/browse/SOLR-9252
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
> Attachments: enron1.zip
>
>
> SOLR-9186 come up with a challenges that for each iterative we have to 
> rebuild the tf-idf vector for each documents. It is costly computation if we 
> represent doc by a lot of terms. Features selection can help reducing the 
> computation.
> Due to its computational efficiency and simple interpretation, information 
> gain is one of the most popular feature selection methods. It is used to 
> measure the dependence between features and labels and calculates the 
> information gain between the i-th feature and the class labels 
> (http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).
> I confirmed that by running logistics regressions 

[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-06-26 Thread Cao Manh Dat (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cao Manh Dat updated SOLR-9252:
---
Attachment: enron1.zip

Enron mail dataset

> Feature selection and logistic regression on text
> -
>
> Key: SOLR-9252
> URL: https://issues.apache.org/jira/browse/SOLR-9252
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
> Attachments: enron1.zip
>
>
> SOLR-9186 come up with a challenges that for each iterative we have to 
> rebuild the tf-idf vector for each documents. It is costly computation if we 
> represent doc by a lot of terms. Features selection can help reducing the 
> computation.
> Due to its computational efficiency and simple interpretation, information 
> gain is one of the most popular feature selection methods. It is used to 
> measure the dependence between features and labels and calculates the 
> information gain between the i-th feature and the class labels 
> (http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).
> I confirmed that by running logistics regressions on enron mail dataset (in 
> which each email is represented by top 100 terms that have highest 
> information gain) and got the accuracy by 92% and precision by 82%.
> This ticket will create two new streaming expression. Both of them use the 
> same *parallel iterative framework* as SOLR-8492.
> {code}
> featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", 
> positiveLabel=1, numTerms=100)
> {code}
> featuresSelection will emit top terms that have highest information gain 
> scores. It can be combined with new tlogit stream.
> {code}
> tlogit(collection1, q="*:*",
>  featuresSelection(collection1, 
>   q="*:*",  
>   field="tv_text", 
>   outcome="out_i", 
>   positiveLabel=1, 
>   numTerms=100),
>  field="tv_text",
>  outcome="out_i",
>  maxIterations=100)
> {code}
> This will support use cases such as building models for spam detection, 
> sentiment analysis and threat detection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-9252) Feature selection and logistic regression on text

2016-06-26 Thread Cao Manh Dat (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cao Manh Dat updated SOLR-9252:
---
Description: 
SOLR-9186 come up with a challenges that for each iterative we have to rebuild 
the tf-idf vector for each documents. It is costly computation if we represent 
doc by a lot of terms. Features selection can help reducing the computation.

Due to its computational efficiency and simple interpretation, information gain 
is one of the most popular feature selection methods. It is used to measure the 
dependence between features and labels and calculates the information gain 
between the i-th feature and the class labels 
(http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).

I confirmed that by running logistics regressions on enron mail dataset (in 
which each email is represented by top 100 terms that have highest information 
gain) and got the accuracy by 92% and precision by 82%.

This ticket will create two new streaming expression. Both of them use the same 
*parallel iterative framework* as SOLR-8492.

{code}
featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", 
positiveLabel=1, numTerms=100)
{code}

featuresSelection will emit top terms that have highest information gain 
scores. It can be combined with new tlogit stream.

{code}
tlogit(collection1, q="*:*",
 featuresSelection(collection1, 
  q="*:*",  
  field="tv_text", 
  outcome="out_i", 
  positiveLabel=1, 
  numTerms=100),
 field="tv_text",
 outcome="out_i",
 maxIterations=100)
{code}

This will support use cases such as building models for spam detection, 
sentiment analysis and threat detection.


  was:
SOLR-9186 come up with a challenges that for each iterative we have to rebuild 
the tf-idf vector for each documents. It is costly computation if we represent 
doc by a lot of terms. Features selection can help reducing the computation.

Due to its computational efficiency and simple interpretation, information gain 
is one of the most popular feature selection methods. It is used to measure the 
dependence between features and labels and calculates the information gain 
between the i-th feature and the class labels 
(http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).

I confirmed that by running logistics regressions on encon mail dataset (in 
which each email is represented by top 100 terms that have highest information 
gain) and got the accuracy by 92% and precision by 82%.

This ticket will create two new streaming expression. Both of them use the same 
*parallel iterative framework* as SOLR-8492.

```
featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", 
positiveLabel=1, numTerms=100)
```
featuresSelection will emit top terms that have highest information gain 
scores. It can be combined with new tlogit stream.

```
tlogit(collection1, q="*:*",
 featuresSelection(collection1, 
  q="*:*",  
  field="tv_text", 
  outcome="out_i", 
  positiveLabel=1, 
  numTerms=100),
 field="tv_text",
 outcome="out_i",
 maxIterations=100)
```

This will support use cases such as building models for spam detection, 
sentiment analysis and threat detection.



> Feature selection and logistic regression on text
> -
>
> Key: SOLR-9252
> URL: https://issues.apache.org/jira/browse/SOLR-9252
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>
> SOLR-9186 come up with a challenges that for each iterative we have to 
> rebuild the tf-idf vector for each documents. It is costly computation if we 
> represent doc by a lot of terms. Features selection can help reducing the 
> computation.
> Due to its computational efficiency and simple interpretation, information 
> gain is one of the most popular feature selection methods. It is used to 
> measure the dependence between features and labels and calculates the 
> information gain between the i-th feature and the class labels 
> (http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).
> I confirmed that by running logistics regressions on enron mail dataset (in 
> which each email is represented by top 100 terms that have highest 
> information gain) and got the accuracy by 92% and precision by 82%.
> This ticket will create two new streaming expression. Both of them use the 
> same