[jira] [Commented] (SOLR-9258) Optimizing, storing and deploying AI models with Streaming Expressions

2016-09-29 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534654#comment-15534654
 ] 

Joel Bernstein commented on SOLR-9258:
--

[~caomanhdat], thanks for all your work on this great ticket!

> Optimizing, storing and deploying AI models with Streaming Expressions
> --
>
> Key: SOLR-9258
> URL: https://issues.apache.org/jira/browse/SOLR-9258
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
> Fix For: 6.3
>
> Attachments: ModelCache.java, ModelCache.java, SOLR-9258.patch, 
> SOLR-9258.patch, SOLR-9258.patch, SOLR-9258.patch, SOLR-9258.patch, 
> SOLR-9258.patch
>
>
> This ticket describes a framework for *optimizing*, *storing* and *deploying* 
> AI models within the Streaming Expression framework.
> *Optimizing*
> [~caomanhdat], has contributed SOLR-9252 which provides *Streaming 
> Expressions* for both feature selection and optimization of a logistic 
> regression text classifier. SOLR-9252 also provides a great working example 
> of *optimization* of a machine learning model using an in-place parallel 
> iterative algorithm.
> *Storing*
> Both features and optimized models can be stored in SolrCloud collections 
> using the update expression. Using [~caomanhdat]'s example in SOLR-9252, the 
> pseudo code for storing features would be:
> {code}
> update(featuresCollection, 
>featuresSelection(collection1, 
> id="myFeatures", 
> q="*:*",  
> field="tv_text", 
> outcome="out_i", 
> positiveLabel=1, 
> numTerms=100))
> {code}  
> The id field can be added to the featureSelection expression so that features 
> can be later retrieved from the collection it's stored in.
> *Deploying*
> With the introduction of the topic() expression, SolrCloud can be treated as 
> a distributed message queue. This messaging capability can  be used to deploy 
> models and process data through the models.
> To implement this approach a classify() function can be created that uses a 
> topic() function to return both the model and the data to be classified:
> The pseudo code looks like this:
> {code}
> classify(topic(models, q="modelID", fl="features, weights"),
>  topic(emails, q="*:*", fl="id, body", rows="500", version="3232323"))
> {code}
> In the example above the classify() function uses the topic() function to 
> retrieve the model. Each time there is an update to the model in the index, 
> the topic() expression will automatically read the new model.
> The topic function() is also used to pull in the data set that is being 
> classified. Notice the *version* parameter. This will be added to the topic 
> function to support pulling results from a specific version number (jira 
> ticket to follow).
> With this approach both the model and the data to process through the model 
> are treated as messages in a message queue.
> The daemon function can be used to send the classify function to Solr where 
> it will be run in the background. The pseudo code looks like this:
> {code}
> daemon(...,
>  update(classifiedEmails, 
>  classify(topic(models, q="modelID", fl="features, weights"),
>   topic(emails, q="*:*", fl="id, fl, body", 
> rows="500", version="3232323"
> {code}
> In this scenario the daemon will run the classify function repeatedly in the 
> background. With each run the topic() functions will re-pull the model if the 
> model has been updated. It will also pull a new set of emails to be 
> classified. The classified emails can be stored in another SolrCloud 
> collection using the update() function.
> Using this approach emails can be classified in batches. The daemon can 
> continue to run even after all all the emails have been classified. New 
> emails added to the emails collections will then be automatically classified 
> when they enter the index.
> Classification can be done in parallel once SOLR-9240 is completed. This will 
> allow topic() results to be partitioned across worker nodes so they can be 
> processed in parallel. The pseudo code for this is:
> {code}
> parallel(workerCollection, worker="20", ...,
>  daemon(...,
>update(classifiedEmails, 
>classify(topic(models, q="modelID", fl="features, 
> weights", partitionKeys="none"),
> topic(emails, q="*:*", fl="id, fl, body", 
> rows="500", version="3232323", partitionKeys="id")
> {code}
> The code above sends a daemon to 20 workers, which will 

[jira] [Commented] (SOLR-9258) Optimizing, storing and deploying AI models with Streaming Expressions

2016-09-29 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534649#comment-15534649
 ] 

Cao Manh Dat commented on SOLR-9258:


Thanks [~joel.bernstein] for your hard work on this ticket.

> Optimizing, storing and deploying AI models with Streaming Expressions
> --
>
> Key: SOLR-9258
> URL: https://issues.apache.org/jira/browse/SOLR-9258
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
> Fix For: 6.3
>
> Attachments: ModelCache.java, ModelCache.java, SOLR-9258.patch, 
> SOLR-9258.patch, SOLR-9258.patch, SOLR-9258.patch, SOLR-9258.patch, 
> SOLR-9258.patch
>
>
> This ticket describes a framework for *optimizing*, *storing* and *deploying* 
> AI models within the Streaming Expression framework.
> *Optimizing*
> [~caomanhdat], has contributed SOLR-9252 which provides *Streaming 
> Expressions* for both feature selection and optimization of a logistic 
> regression text classifier. SOLR-9252 also provides a great working example 
> of *optimization* of a machine learning model using an in-place parallel 
> iterative algorithm.
> *Storing*
> Both features and optimized models can be stored in SolrCloud collections 
> using the update expression. Using [~caomanhdat]'s example in SOLR-9252, the 
> pseudo code for storing features would be:
> {code}
> update(featuresCollection, 
>featuresSelection(collection1, 
> id="myFeatures", 
> q="*:*",  
> field="tv_text", 
> outcome="out_i", 
> positiveLabel=1, 
> numTerms=100))
> {code}  
> The id field can be added to the featureSelection expression so that features 
> can be later retrieved from the collection it's stored in.
> *Deploying*
> With the introduction of the topic() expression, SolrCloud can be treated as 
> a distributed message queue. This messaging capability can  be used to deploy 
> models and process data through the models.
> To implement this approach a classify() function can be created that uses a 
> topic() function to return both the model and the data to be classified:
> The pseudo code looks like this:
> {code}
> classify(topic(models, q="modelID", fl="features, weights"),
>  topic(emails, q="*:*", fl="id, body", rows="500", version="3232323"))
> {code}
> In the example above the classify() function uses the topic() function to 
> retrieve the model. Each time there is an update to the model in the index, 
> the topic() expression will automatically read the new model.
> The topic function() is also used to pull in the data set that is being 
> classified. Notice the *version* parameter. This will be added to the topic 
> function to support pulling results from a specific version number (jira 
> ticket to follow).
> With this approach both the model and the data to process through the model 
> are treated as messages in a message queue.
> The daemon function can be used to send the classify function to Solr where 
> it will be run in the background. The pseudo code looks like this:
> {code}
> daemon(...,
>  update(classifiedEmails, 
>  classify(topic(models, q="modelID", fl="features, weights"),
>   topic(emails, q="*:*", fl="id, fl, body", 
> rows="500", version="3232323"
> {code}
> In this scenario the daemon will run the classify function repeatedly in the 
> background. With each run the topic() functions will re-pull the model if the 
> model has been updated. It will also pull a new set of emails to be 
> classified. The classified emails can be stored in another SolrCloud 
> collection using the update() function.
> Using this approach emails can be classified in batches. The daemon can 
> continue to run even after all all the emails have been classified. New 
> emails added to the emails collections will then be automatically classified 
> when they enter the index.
> Classification can be done in parallel once SOLR-9240 is completed. This will 
> allow topic() results to be partitioned across worker nodes so they can be 
> processed in parallel. The pseudo code for this is:
> {code}
> parallel(workerCollection, worker="20", ...,
>  daemon(...,
>update(classifiedEmails, 
>classify(topic(models, q="modelID", fl="features, 
> weights", partitionKeys="none"),
> topic(emails, q="*:*", fl="id, fl, body", 
> rows="500", version="3232323", partitionKeys="id")
> {code}
> The code above sends a daemon to 20 workers, which will each 

[jira] [Commented] (SOLR-9258) Optimizing, storing and deploying AI models with Streaming Expressions

2016-09-29 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534646#comment-15534646
 ] 

ASF subversion and git services commented on SOLR-9258:
---

Commit 5adb8f1bd5905f6749e57b7e27d467a4f36c56b2 in lucene-solr's branch 
refs/heads/branch_6x from [~joel.bernstein]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5adb8f1 ]

SOLR-9258: Update CHANGES.txt


> Optimizing, storing and deploying AI models with Streaming Expressions
> --
>
> Key: SOLR-9258
> URL: https://issues.apache.org/jira/browse/SOLR-9258
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
> Fix For: 6.2
>
> Attachments: ModelCache.java, ModelCache.java, SOLR-9258.patch, 
> SOLR-9258.patch, SOLR-9258.patch, SOLR-9258.patch, SOLR-9258.patch, 
> SOLR-9258.patch
>
>
> This ticket describes a framework for *optimizing*, *storing* and *deploying* 
> AI models within the Streaming Expression framework.
> *Optimizing*
> [~caomanhdat], has contributed SOLR-9252 which provides *Streaming 
> Expressions* for both feature selection and optimization of a logistic 
> regression text classifier. SOLR-9252 also provides a great working example 
> of *optimization* of a machine learning model using an in-place parallel 
> iterative algorithm.
> *Storing*
> Both features and optimized models can be stored in SolrCloud collections 
> using the update expression. Using [~caomanhdat]'s example in SOLR-9252, the 
> pseudo code for storing features would be:
> {code}
> update(featuresCollection, 
>featuresSelection(collection1, 
> id="myFeatures", 
> q="*:*",  
> field="tv_text", 
> outcome="out_i", 
> positiveLabel=1, 
> numTerms=100))
> {code}  
> The id field can be added to the featureSelection expression so that features 
> can be later retrieved from the collection it's stored in.
> *Deploying*
> With the introduction of the topic() expression, SolrCloud can be treated as 
> a distributed message queue. This messaging capability can  be used to deploy 
> models and process data through the models.
> To implement this approach a classify() function can be created that uses a 
> topic() function to return both the model and the data to be classified:
> The pseudo code looks like this:
> {code}
> classify(topic(models, q="modelID", fl="features, weights"),
>  topic(emails, q="*:*", fl="id, body", rows="500", version="3232323"))
> {code}
> In the example above the classify() function uses the topic() function to 
> retrieve the model. Each time there is an update to the model in the index, 
> the topic() expression will automatically read the new model.
> The topic function() is also used to pull in the data set that is being 
> classified. Notice the *version* parameter. This will be added to the topic 
> function to support pulling results from a specific version number (jira 
> ticket to follow).
> With this approach both the model and the data to process through the model 
> are treated as messages in a message queue.
> The daemon function can be used to send the classify function to Solr where 
> it will be run in the background. The pseudo code looks like this:
> {code}
> daemon(...,
>  update(classifiedEmails, 
>  classify(topic(models, q="modelID", fl="features, weights"),
>   topic(emails, q="*:*", fl="id, fl, body", 
> rows="500", version="3232323"
> {code}
> In this scenario the daemon will run the classify function repeatedly in the 
> background. With each run the topic() functions will re-pull the model if the 
> model has been updated. It will also pull a new set of emails to be 
> classified. The classified emails can be stored in another SolrCloud 
> collection using the update() function.
> Using this approach emails can be classified in batches. The daemon can 
> continue to run even after all all the emails have been classified. New 
> emails added to the emails collections will then be automatically classified 
> when they enter the index.
> Classification can be done in parallel once SOLR-9240 is completed. This will 
> allow topic() results to be partitioned across worker nodes so they can be 
> processed in parallel. The pseudo code for this is:
> {code}
> parallel(workerCollection, worker="20", ...,
>  daemon(...,
>update(classifiedEmails, 
>classify(topic(models, q="modelID", fl="features, 
> weights", partitionKeys="none"),
>   

[jira] [Commented] (SOLR-9258) Optimizing, storing and deploying AI models with Streaming Expressions

2016-09-29 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534643#comment-15534643
 ] 

ASF subversion and git services commented on SOLR-9258:
---

Commit 787d905edcf813f2e02155aabcc0c1dd25509b21 in lucene-solr's branch 
refs/heads/master from [~joel.bernstein]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=787d905 ]

SOLR-9258: Update CHANGES.txt


> Optimizing, storing and deploying AI models with Streaming Expressions
> --
>
> Key: SOLR-9258
> URL: https://issues.apache.org/jira/browse/SOLR-9258
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
> Fix For: 6.2
>
> Attachments: ModelCache.java, ModelCache.java, SOLR-9258.patch, 
> SOLR-9258.patch, SOLR-9258.patch, SOLR-9258.patch, SOLR-9258.patch, 
> SOLR-9258.patch
>
>
> This ticket describes a framework for *optimizing*, *storing* and *deploying* 
> AI models within the Streaming Expression framework.
> *Optimizing*
> [~caomanhdat], has contributed SOLR-9252 which provides *Streaming 
> Expressions* for both feature selection and optimization of a logistic 
> regression text classifier. SOLR-9252 also provides a great working example 
> of *optimization* of a machine learning model using an in-place parallel 
> iterative algorithm.
> *Storing*
> Both features and optimized models can be stored in SolrCloud collections 
> using the update expression. Using [~caomanhdat]'s example in SOLR-9252, the 
> pseudo code for storing features would be:
> {code}
> update(featuresCollection, 
>featuresSelection(collection1, 
> id="myFeatures", 
> q="*:*",  
> field="tv_text", 
> outcome="out_i", 
> positiveLabel=1, 
> numTerms=100))
> {code}  
> The id field can be added to the featureSelection expression so that features 
> can be later retrieved from the collection it's stored in.
> *Deploying*
> With the introduction of the topic() expression, SolrCloud can be treated as 
> a distributed message queue. This messaging capability can  be used to deploy 
> models and process data through the models.
> To implement this approach a classify() function can be created that uses a 
> topic() function to return both the model and the data to be classified:
> The pseudo code looks like this:
> {code}
> classify(topic(models, q="modelID", fl="features, weights"),
>  topic(emails, q="*:*", fl="id, body", rows="500", version="3232323"))
> {code}
> In the example above the classify() function uses the topic() function to 
> retrieve the model. Each time there is an update to the model in the index, 
> the topic() expression will automatically read the new model.
> The topic function() is also used to pull in the data set that is being 
> classified. Notice the *version* parameter. This will be added to the topic 
> function to support pulling results from a specific version number (jira 
> ticket to follow).
> With this approach both the model and the data to process through the model 
> are treated as messages in a message queue.
> The daemon function can be used to send the classify function to Solr where 
> it will be run in the background. The pseudo code looks like this:
> {code}
> daemon(...,
>  update(classifiedEmails, 
>  classify(topic(models, q="modelID", fl="features, weights"),
>   topic(emails, q="*:*", fl="id, fl, body", 
> rows="500", version="3232323"
> {code}
> In this scenario the daemon will run the classify function repeatedly in the 
> background. With each run the topic() functions will re-pull the model if the 
> model has been updated. It will also pull a new set of emails to be 
> classified. The classified emails can be stored in another SolrCloud 
> collection using the update() function.
> Using this approach emails can be classified in batches. The daemon can 
> continue to run even after all all the emails have been classified. New 
> emails added to the emails collections will then be automatically classified 
> when they enter the index.
> Classification can be done in parallel once SOLR-9240 is completed. This will 
> allow topic() results to be partitioned across worker nodes so they can be 
> processed in parallel. The pseudo code for this is:
> {code}
> parallel(workerCollection, worker="20", ...,
>  daemon(...,
>update(classifiedEmails, 
>classify(topic(models, q="modelID", fl="features, 
> weights", partitionKeys="none"),
>  

[jira] [Commented] (SOLR-9258) Optimizing, storing and deploying AI models with Streaming Expressions

2016-09-29 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534473#comment-15534473
 ] 

ASF subversion and git services commented on SOLR-9258:
---

Commit 568b54687a938ed0f6cd8b29100eda2c0b547975 in lucene-solr's branch 
refs/heads/branch_6x from [~joel.bernstein]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=568b546 ]

SOLR-9258: Fix precommit


> Optimizing, storing and deploying AI models with Streaming Expressions
> --
>
> Key: SOLR-9258
> URL: https://issues.apache.org/jira/browse/SOLR-9258
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
> Fix For: 6.2
>
> Attachments: ModelCache.java, ModelCache.java, SOLR-9258.patch, 
> SOLR-9258.patch, SOLR-9258.patch, SOLR-9258.patch, SOLR-9258.patch, 
> SOLR-9258.patch
>
>
> This ticket describes a framework for *optimizing*, *storing* and *deploying* 
> AI models within the Streaming Expression framework.
> *Optimizing*
> [~caomanhdat], has contributed SOLR-9252 which provides *Streaming 
> Expressions* for both feature selection and optimization of a logistic 
> regression text classifier. SOLR-9252 also provides a great working example 
> of *optimization* of a machine learning model using an in-place parallel 
> iterative algorithm.
> *Storing*
> Both features and optimized models can be stored in SolrCloud collections 
> using the update expression. Using [~caomanhdat]'s example in SOLR-9252, the 
> pseudo code for storing features would be:
> {code}
> update(featuresCollection, 
>featuresSelection(collection1, 
> id="myFeatures", 
> q="*:*",  
> field="tv_text", 
> outcome="out_i", 
> positiveLabel=1, 
> numTerms=100))
> {code}  
> The id field can be added to the featureSelection expression so that features 
> can be later retrieved from the collection it's stored in.
> *Deploying*
> With the introduction of the topic() expression, SolrCloud can be treated as 
> a distributed message queue. This messaging capability can  be used to deploy 
> models and process data through the models.
> To implement this approach a classify() function can be created that uses a 
> topic() function to return both the model and the data to be classified:
> The pseudo code looks like this:
> {code}
> classify(topic(models, q="modelID", fl="features, weights"),
>  topic(emails, q="*:*", fl="id, body", rows="500", version="3232323"))
> {code}
> In the example above the classify() function uses the topic() function to 
> retrieve the model. Each time there is an update to the model in the index, 
> the topic() expression will automatically read the new model.
> The topic function() is also used to pull in the data set that is being 
> classified. Notice the *version* parameter. This will be added to the topic 
> function to support pulling results from a specific version number (jira 
> ticket to follow).
> With this approach both the model and the data to process through the model 
> are treated as messages in a message queue.
> The daemon function can be used to send the classify function to Solr where 
> it will be run in the background. The pseudo code looks like this:
> {code}
> daemon(...,
>  update(classifiedEmails, 
>  classify(topic(models, q="modelID", fl="features, weights"),
>   topic(emails, q="*:*", fl="id, fl, body", 
> rows="500", version="3232323"
> {code}
> In this scenario the daemon will run the classify function repeatedly in the 
> background. With each run the topic() functions will re-pull the model if the 
> model has been updated. It will also pull a new set of emails to be 
> classified. The classified emails can be stored in another SolrCloud 
> collection using the update() function.
> Using this approach emails can be classified in batches. The daemon can 
> continue to run even after all all the emails have been classified. New 
> emails added to the emails collections will then be automatically classified 
> when they enter the index.
> Classification can be done in parallel once SOLR-9240 is completed. This will 
> allow topic() results to be partitioned across worker nodes so they can be 
> processed in parallel. The pseudo code for this is:
> {code}
> parallel(workerCollection, worker="20", ...,
>  daemon(...,
>update(classifiedEmails, 
>classify(topic(models, q="modelID", fl="features, 
> weights", partitionKeys="none"),
>

[jira] [Commented] (SOLR-9258) Optimizing, storing and deploying AI models with Streaming Expressions

2016-09-29 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534333#comment-15534333
 ] 

ASF subversion and git services commented on SOLR-9258:
---

Commit 9cd6437d4b21dd6d9c16688eedb5af012ea67e86 in lucene-solr's branch 
refs/heads/master from [~joel.bernstein]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=9cd6437 ]

SOLR-9258: Optimizing, storing and deploying AI models with Streaming 
Expressions


> Optimizing, storing and deploying AI models with Streaming Expressions
> --
>
> Key: SOLR-9258
> URL: https://issues.apache.org/jira/browse/SOLR-9258
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
> Fix For: 6.2
>
> Attachments: ModelCache.java, ModelCache.java, SOLR-9258.patch, 
> SOLR-9258.patch, SOLR-9258.patch, SOLR-9258.patch, SOLR-9258.patch, 
> SOLR-9258.patch
>
>
> This ticket describes a framework for *optimizing*, *storing* and *deploying* 
> AI models within the Streaming Expression framework.
> *Optimizing*
> [~caomanhdat], has contributed SOLR-9252 which provides *Streaming 
> Expressions* for both feature selection and optimization of a logistic 
> regression text classifier. SOLR-9252 also provides a great working example 
> of *optimization* of a machine learning model using an in-place parallel 
> iterative algorithm.
> *Storing*
> Both features and optimized models can be stored in SolrCloud collections 
> using the update expression. Using [~caomanhdat]'s example in SOLR-9252, the 
> pseudo code for storing features would be:
> {code}
> update(featuresCollection, 
>featuresSelection(collection1, 
> id="myFeatures", 
> q="*:*",  
> field="tv_text", 
> outcome="out_i", 
> positiveLabel=1, 
> numTerms=100))
> {code}  
> The id field can be added to the featureSelection expression so that features 
> can be later retrieved from the collection it's stored in.
> *Deploying*
> With the introduction of the topic() expression, SolrCloud can be treated as 
> a distributed message queue. This messaging capability can  be used to deploy 
> models and process data through the models.
> To implement this approach a classify() function can be created that uses a 
> topic() function to return both the model and the data to be classified:
> The pseudo code looks like this:
> {code}
> classify(topic(models, q="modelID", fl="features, weights"),
>  topic(emails, q="*:*", fl="id, body", rows="500", version="3232323"))
> {code}
> In the example above the classify() function uses the topic() function to 
> retrieve the model. Each time there is an update to the model in the index, 
> the topic() expression will automatically read the new model.
> The topic function() is also used to pull in the data set that is being 
> classified. Notice the *version* parameter. This will be added to the topic 
> function to support pulling results from a specific version number (jira 
> ticket to follow).
> With this approach both the model and the data to process through the model 
> are treated as messages in a message queue.
> The daemon function can be used to send the classify function to Solr where 
> it will be run in the background. The pseudo code looks like this:
> {code}
> daemon(...,
>  update(classifiedEmails, 
>  classify(topic(models, q="modelID", fl="features, weights"),
>   topic(emails, q="*:*", fl="id, fl, body", 
> rows="500", version="3232323"
> {code}
> In this scenario the daemon will run the classify function repeatedly in the 
> background. With each run the topic() functions will re-pull the model if the 
> model has been updated. It will also pull a new set of emails to be 
> classified. The classified emails can be stored in another SolrCloud 
> collection using the update() function.
> Using this approach emails can be classified in batches. The daemon can 
> continue to run even after all all the emails have been classified. New 
> emails added to the emails collections will then be automatically classified 
> when they enter the index.
> Classification can be done in parallel once SOLR-9240 is completed. This will 
> allow topic() results to be partitioned across worker nodes so they can be 
> processed in parallel. The pseudo code for this is:
> {code}
> parallel(workerCollection, worker="20", ...,
>  daemon(...,
>update(classifiedEmails, 
>classify(topic(models, q="modelID", fl="features, 

[jira] [Commented] (SOLR-9258) Optimizing, storing and deploying AI models with Streaming Expressions

2016-09-29 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534334#comment-15534334
 ] 

ASF subversion and git services commented on SOLR-9258:
---

Commit 8f00bcb1a0d88a6898e3ae6b8749610b2bd47d3c in lucene-solr's branch 
refs/heads/master from [~joel.bernstein]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8f00bcb ]

SOLR-9258: Fix precommit


> Optimizing, storing and deploying AI models with Streaming Expressions
> --
>
> Key: SOLR-9258
> URL: https://issues.apache.org/jira/browse/SOLR-9258
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
> Fix For: 6.2
>
> Attachments: ModelCache.java, ModelCache.java, SOLR-9258.patch, 
> SOLR-9258.patch, SOLR-9258.patch, SOLR-9258.patch, SOLR-9258.patch, 
> SOLR-9258.patch
>
>
> This ticket describes a framework for *optimizing*, *storing* and *deploying* 
> AI models within the Streaming Expression framework.
> *Optimizing*
> [~caomanhdat], has contributed SOLR-9252 which provides *Streaming 
> Expressions* for both feature selection and optimization of a logistic 
> regression text classifier. SOLR-9252 also provides a great working example 
> of *optimization* of a machine learning model using an in-place parallel 
> iterative algorithm.
> *Storing*
> Both features and optimized models can be stored in SolrCloud collections 
> using the update expression. Using [~caomanhdat]'s example in SOLR-9252, the 
> pseudo code for storing features would be:
> {code}
> update(featuresCollection, 
>featuresSelection(collection1, 
> id="myFeatures", 
> q="*:*",  
> field="tv_text", 
> outcome="out_i", 
> positiveLabel=1, 
> numTerms=100))
> {code}  
> The id field can be added to the featureSelection expression so that features 
> can be later retrieved from the collection it's stored in.
> *Deploying*
> With the introduction of the topic() expression, SolrCloud can be treated as 
> a distributed message queue. This messaging capability can  be used to deploy 
> models and process data through the models.
> To implement this approach a classify() function can be created that uses a 
> topic() function to return both the model and the data to be classified:
> The pseudo code looks like this:
> {code}
> classify(topic(models, q="modelID", fl="features, weights"),
>  topic(emails, q="*:*", fl="id, body", rows="500", version="3232323"))
> {code}
> In the example above the classify() function uses the topic() function to 
> retrieve the model. Each time there is an update to the model in the index, 
> the topic() expression will automatically read the new model.
> The topic function() is also used to pull in the data set that is being 
> classified. Notice the *version* parameter. This will be added to the topic 
> function to support pulling results from a specific version number (jira 
> ticket to follow).
> With this approach both the model and the data to process through the model 
> are treated as messages in a message queue.
> The daemon function can be used to send the classify function to Solr where 
> it will be run in the background. The pseudo code looks like this:
> {code}
> daemon(...,
>  update(classifiedEmails, 
>  classify(topic(models, q="modelID", fl="features, weights"),
>   topic(emails, q="*:*", fl="id, fl, body", 
> rows="500", version="3232323"
> {code}
> In this scenario the daemon will run the classify function repeatedly in the 
> background. With each run the topic() functions will re-pull the model if the 
> model has been updated. It will also pull a new set of emails to be 
> classified. The classified emails can be stored in another SolrCloud 
> collection using the update() function.
> Using this approach emails can be classified in batches. The daemon can 
> continue to run even after all all the emails have been classified. New 
> emails added to the emails collections will then be automatically classified 
> when they enter the index.
> Classification can be done in parallel once SOLR-9240 is completed. This will 
> allow topic() results to be partitioned across worker nodes so they can be 
> processed in parallel. The pseudo code for this is:
> {code}
> parallel(workerCollection, worker="20", ...,
>  daemon(...,
>update(classifiedEmails, 
>classify(topic(models, q="modelID", fl="features, 
> weights", partitionKeys="none"),
>   

[jira] [Commented] (SOLR-9258) Optimizing, storing and deploying AI models with Streaming Expressions

2016-09-20 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15506635#comment-15506635
 ] 

Cao Manh Dat commented on SOLR-9258:


+1 The patch look great.

> Optimizing, storing and deploying AI models with Streaming Expressions
> --
>
> Key: SOLR-9258
> URL: https://issues.apache.org/jira/browse/SOLR-9258
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
> Fix For: 6.2
>
> Attachments: ModelCache.java, ModelCache.java, SOLR-9258.patch, 
> SOLR-9258.patch
>
>
> This ticket describes a framework for *optimizing*, *storing* and *deploying* 
> AI models within the Streaming Expression framework.
> *Optimizing*
> [~caomanhdat], has contributed SOLR-9252 which provides *Streaming 
> Expressions* for both feature selection and optimization of a logistic 
> regression text classifier. SOLR-9252 also provides a great working example 
> of *optimization* of a machine learning model using an in-place parallel 
> iterative algorithm.
> *Storing*
> Both features and optimized models can be stored in SolrCloud collections 
> using the update expression. Using [~caomanhdat]'s example in SOLR-9252, the 
> pseudo code for storing features would be:
> {code}
> update(featuresCollection, 
>featuresSelection(collection1, 
> id="myFeatures", 
> q="*:*",  
> field="tv_text", 
> outcome="out_i", 
> positiveLabel=1, 
> numTerms=100))
> {code}  
> The id field can be added to the featureSelection expression so that features 
> can be later retrieved from the collection it's stored in.
> *Deploying*
> With the introduction of the topic() expression, SolrCloud can be treated as 
> a distributed message queue. This messaging capability can  be used to deploy 
> models and process data through the models.
> To implement this approach a classify() function can be created that uses a 
> topic() function to return both the model and the data to be classified:
> The pseudo code looks like this:
> {code}
> classify(topic(models, q="modelID", fl="features, weights"),
>  topic(emails, q="*:*", fl="id, body", rows="500", version="3232323"))
> {code}
> In the example above the classify() function uses the topic() function to 
> retrieve the model. Each time there is an update to the model in the index, 
> the topic() expression will automatically read the new model.
> The topic function() is also used to pull in the data set that is being 
> classified. Notice the *version* parameter. This will be added to the topic 
> function to support pulling results from a specific version number (jira 
> ticket to follow).
> With this approach both the model and the data to process through the model 
> are treated as messages in a message queue.
> The daemon function can be used to send the classify function to Solr where 
> it will be run in the background. The pseudo code looks like this:
> {code}
> daemon(...,
>  update(classifiedEmails, 
>  classify(topic(models, q="modelID", fl="features, weights"),
>   topic(emails, q="*:*", fl="id, fl, body", 
> rows="500", version="3232323"
> {code}
> In this scenario the daemon will run the classify function repeatedly in the 
> background. With each run the topic() functions will re-pull the model if the 
> model has been updated. It will also pull a new set of emails to be 
> classified. The classified emails can be stored in another SolrCloud 
> collection using the update() function.
> Using this approach emails can be classified in batches. The daemon can 
> continue to run even after all all the emails have been classified. New 
> emails added to the emails collections will then be automatically classified 
> when they enter the index.
> Classification can be done in parallel once SOLR-9240 is completed. This will 
> allow topic() results to be partitioned across worker nodes so they can be 
> processed in parallel. The pseudo code for this is:
> {code}
> parallel(workerCollection, worker="20", ...,
>  daemon(...,
>update(classifiedEmails, 
>classify(topic(models, q="modelID", fl="features, 
> weights", partitionKeys="none"),
> topic(emails, q="*:*", fl="id, fl, body", 
> rows="500", version="3232323", partitionKeys="id")
> {code}
> The code above sends a daemon to 20 workers, which will each classify a 
> partition of records pulled by the topic() function.
> *AI based alerting*
> If the *version* 

[jira] [Commented] (SOLR-9258) Optimizing, storing and deploying AI models with Streaming Expressions

2016-09-15 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493086#comment-15493086
 ] 

Joel Bernstein commented on SOLR-9258:
--

Ah I see the problem now.

Ok, it sounds like any Stream that uses analyzers will be tied to Solr core. 
This is OK, because the main use case is to run the expressions through 
StreamHandler.

Maybe we need a new package in Solr core for streams that rely on Solr core 
classes. I'll put some thought into this.

I'll keep working with the patch.

> Optimizing, storing and deploying AI models with Streaming Expressions
> --
>
> Key: SOLR-9258
> URL: https://issues.apache.org/jira/browse/SOLR-9258
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
> Fix For: 6.2
>
> Attachments: ModelCache.java, ModelCache.java, SOLR-9258.patch
>
>
> This ticket describes a framework for *optimizing*, *storing* and *deploying* 
> AI models within the Streaming Expression framework.
> *Optimizing*
> [~caomanhdat], has contributed SOLR-9252 which provides *Streaming 
> Expressions* for both feature selection and optimization of a logistic 
> regression text classifier. SOLR-9252 also provides a great working example 
> of *optimization* of a machine learning model using an in-place parallel 
> iterative algorithm.
> *Storing*
> Both features and optimized models can be stored in SolrCloud collections 
> using the update expression. Using [~caomanhdat]'s example in SOLR-9252, the 
> pseudo code for storing features would be:
> {code}
> update(featuresCollection, 
>featuresSelection(collection1, 
> id="myFeatures", 
> q="*:*",  
> field="tv_text", 
> outcome="out_i", 
> positiveLabel=1, 
> numTerms=100))
> {code}  
> The id field can be added to the featureSelection expression so that features 
> can be later retrieved from the collection it's stored in.
> *Deploying*
> With the introduction of the topic() expression, SolrCloud can be treated as 
> a distributed message queue. This messaging capability can  be used to deploy 
> models and process data through the models.
> To implement this approach a classify() function can be created that uses a 
> topic() function to return both the model and the data to be classified:
> The pseudo code looks like this:
> {code}
> classify(topic(models, q="modelID", fl="features, weights"),
>  topic(emails, q="*:*", fl="id, body", rows="500", version="3232323"))
> {code}
> In the example above the classify() function uses the topic() function to 
> retrieve the model. Each time there is an update to the model in the index, 
> the topic() expression will automatically read the new model.
> The topic function() is also used to pull in the data set that is being 
> classified. Notice the *version* parameter. This will be added to the topic 
> function to support pulling results from a specific version number (jira 
> ticket to follow).
> With this approach both the model and the data to process through the model 
> are treated as messages in a message queue.
> The daemon function can be used to send the classify function to Solr where 
> it will be run in the background. The pseudo code looks like this:
> {code}
> daemon(...,
>  update(classifiedEmails, 
>  classify(topic(models, q="modelID", fl="features, weights"),
>   topic(emails, q="*:*", fl="id, fl, body", 
> rows="500", version="3232323"
> {code}
> In this scenario the daemon will run the classify function repeatedly in the 
> background. With each run the topic() functions will re-pull the model if the 
> model has been updated. It will also pull a new set of emails to be 
> classified. The classified emails can be stored in another SolrCloud 
> collection using the update() function.
> Using this approach emails can be classified in batches. The daemon can 
> continue to run even after all all the emails have been classified. New 
> emails added to the emails collections will then be automatically classified 
> when they enter the index.
> Classification can be done in parallel once SOLR-9240 is completed. This will 
> allow topic() results to be partitioned across worker nodes so they can be 
> processed in parallel. The pseudo code for this is:
> {code}
> parallel(workerCollection, worker="20", ...,
>  daemon(...,
>update(classifiedEmails, 
>classify(topic(models, q="modelID", fl="features, 
> weights", partitionKeys="none"),
> 

[jira] [Commented] (SOLR-9258) Optimizing, storing and deploying AI models with Streaming Expressions

2016-09-14 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15492424#comment-15492424
 ] 

Cao Manh Dat commented on SOLR-9258:


We cant put ClassifyStream inside sorlj.io, because solrj module is not 
dependent on solr-core or lucene-core, so we cant access Analyzer or SolrCore 
from ClassifyStream. 
I also cant find any package inside solr-core that appropriate for this class. 
So I make ClassifyStream as an inner class of the StreamHandler? (Hint a 
welcome :) )

> Optimizing, storing and deploying AI models with Streaming Expressions
> --
>
> Key: SOLR-9258
> URL: https://issues.apache.org/jira/browse/SOLR-9258
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
> Fix For: 6.2
>
> Attachments: ModelCache.java, ModelCache.java, SOLR-9258.patch
>
>
> This ticket describes a framework for *optimizing*, *storing* and *deploying* 
> AI models within the Streaming Expression framework.
> *Optimizing*
> [~caomanhdat], has contributed SOLR-9252 which provides *Streaming 
> Expressions* for both feature selection and optimization of a logistic 
> regression text classifier. SOLR-9252 also provides a great working example 
> of *optimization* of a machine learning model using an in-place parallel 
> iterative algorithm.
> *Storing*
> Both features and optimized models can be stored in SolrCloud collections 
> using the update expression. Using [~caomanhdat]'s example in SOLR-9252, the 
> pseudo code for storing features would be:
> {code}
> update(featuresCollection, 
>featuresSelection(collection1, 
> id="myFeatures", 
> q="*:*",  
> field="tv_text", 
> outcome="out_i", 
> positiveLabel=1, 
> numTerms=100))
> {code}  
> The id field can be added to the featureSelection expression so that features 
> can be later retrieved from the collection it's stored in.
> *Deploying*
> With the introduction of the topic() expression, SolrCloud can be treated as 
> a distributed message queue. This messaging capability can  be used to deploy 
> models and process data through the models.
> To implement this approach a classify() function can be created that uses a 
> topic() function to return both the model and the data to be classified:
> The pseudo code looks like this:
> {code}
> classify(topic(models, q="modelID", fl="features, weights"),
>  topic(emails, q="*:*", fl="id, body", rows="500", version="3232323"))
> {code}
> In the example above the classify() function uses the topic() function to 
> retrieve the model. Each time there is an update to the model in the index, 
> the topic() expression will automatically read the new model.
> The topic function() is also used to pull in the data set that is being 
> classified. Notice the *version* parameter. This will be added to the topic 
> function to support pulling results from a specific version number (jira 
> ticket to follow).
> With this approach both the model and the data to process through the model 
> are treated as messages in a message queue.
> The daemon function can be used to send the classify function to Solr where 
> it will be run in the background. The pseudo code looks like this:
> {code}
> daemon(...,
>  update(classifiedEmails, 
>  classify(topic(models, q="modelID", fl="features, weights"),
>   topic(emails, q="*:*", fl="id, fl, body", 
> rows="500", version="3232323"
> {code}
> In this scenario the daemon will run the classify function repeatedly in the 
> background. With each run the topic() functions will re-pull the model if the 
> model has been updated. It will also pull a new set of emails to be 
> classified. The classified emails can be stored in another SolrCloud 
> collection using the update() function.
> Using this approach emails can be classified in batches. The daemon can 
> continue to run even after all all the emails have been classified. New 
> emails added to the emails collections will then be automatically classified 
> when they enter the index.
> Classification can be done in parallel once SOLR-9240 is completed. This will 
> allow topic() results to be partitioned across worker nodes so they can be 
> processed in parallel. The pseudo code for this is:
> {code}
> parallel(workerCollection, worker="20", ...,
>  daemon(...,
>update(classifiedEmails, 
>classify(topic(models, q="modelID", fl="features, 
> weights", partitionKeys="none"),
>

[jira] [Commented] (SOLR-9258) Optimizing, storing and deploying AI models with Streaming Expressions

2016-09-14 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15492086#comment-15492086
 ] 

Joel Bernstein commented on SOLR-9258:
--

The more I look at the ModelCache the less I like it. My main concern is the 
synchronization which could become a bottleneck.

Looking into other caching approaches...

> Optimizing, storing and deploying AI models with Streaming Expressions
> --
>
> Key: SOLR-9258
> URL: https://issues.apache.org/jira/browse/SOLR-9258
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
> Fix For: 6.2
>
> Attachments: ModelCache.java, SOLR-9258.patch
>
>
> This ticket describes a framework for *optimizing*, *storing* and *deploying* 
> AI models within the Streaming Expression framework.
> *Optimizing*
> [~caomanhdat], has contributed SOLR-9252 which provides *Streaming 
> Expressions* for both feature selection and optimization of a logistic 
> regression text classifier. SOLR-9252 also provides a great working example 
> of *optimization* of a machine learning model using an in-place parallel 
> iterative algorithm.
> *Storing*
> Both features and optimized models can be stored in SolrCloud collections 
> using the update expression. Using [~caomanhdat]'s example in SOLR-9252, the 
> pseudo code for storing features would be:
> {code}
> update(featuresCollection, 
>featuresSelection(collection1, 
> id="myFeatures", 
> q="*:*",  
> field="tv_text", 
> outcome="out_i", 
> positiveLabel=1, 
> numTerms=100))
> {code}  
> The id field can be added to the featureSelection expression so that features 
> can be later retrieved from the collection it's stored in.
> *Deploying*
> With the introduction of the topic() expression, SolrCloud can be treated as 
> a distributed message queue. This messaging capability can  be used to deploy 
> models and process data through the models.
> To implement this approach a classify() function can be created that uses a 
> topic() function to return both the model and the data to be classified:
> The pseudo code looks like this:
> {code}
> classify(topic(models, q="modelID", fl="features, weights"),
>  topic(emails, q="*:*", fl="id, body", rows="500", version="3232323"))
> {code}
> In the example above the classify() function uses the topic() function to 
> retrieve the model. Each time there is an update to the model in the index, 
> the topic() expression will automatically read the new model.
> The topic function() is also used to pull in the data set that is being 
> classified. Notice the *version* parameter. This will be added to the topic 
> function to support pulling results from a specific version number (jira 
> ticket to follow).
> With this approach both the model and the data to process through the model 
> are treated as messages in a message queue.
> The daemon function can be used to send the classify function to Solr where 
> it will be run in the background. The pseudo code looks like this:
> {code}
> daemon(...,
>  update(classifiedEmails, 
>  classify(topic(models, q="modelID", fl="features, weights"),
>   topic(emails, q="*:*", fl="id, fl, body", 
> rows="500", version="3232323"
> {code}
> In this scenario the daemon will run the classify function repeatedly in the 
> background. With each run the topic() functions will re-pull the model if the 
> model has been updated. It will also pull a new set of emails to be 
> classified. The classified emails can be stored in another SolrCloud 
> collection using the update() function.
> Using this approach emails can be classified in batches. The daemon can 
> continue to run even after all all the emails have been classified. New 
> emails added to the emails collections will then be automatically classified 
> when they enter the index.
> Classification can be done in parallel once SOLR-9240 is completed. This will 
> allow topic() results to be partitioned across worker nodes so they can be 
> processed in parallel. The pseudo code for this is:
> {code}
> parallel(workerCollection, worker="20", ...,
>  daemon(...,
>update(classifiedEmails, 
>classify(topic(models, q="modelID", fl="features, 
> weights", partitionKeys="none"),
> topic(emails, q="*:*", fl="id, fl, body", 
> rows="500", version="3232323", partitionKeys="id")
> {code}
> The code above sends a daemon to 20 workers, which will each 

[jira] [Commented] (SOLR-9258) Optimizing, storing and deploying AI models with Streaming Expressions

2016-09-13 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487842#comment-15487842
 ] 

Joel Bernstein commented on SOLR-9258:
--

Aside from the inner class issue the ClassifyStream is looking very good.

One thing we should talk about is the model streaming. I think it makes sense 
to pull the model in the Classify.open() method. I think it also makes sense to 
have a specific ModelStream implementation that has this behavior:

1) Checks a local cache of the models to see if the model is in memory. The 
cache can be a simple LRUCache.
2) If the model is already in the cache, attempt to pull the model with a 
TopicStream. If nothing comes back, the model hasn't been changed so use the 
cached version. If the model does come back, use the new model and update the 
cache.
3) If the model is not already in the cache, pull the model using a 
CloudSolrStream and update the cache.

The topic checkpoints can be kept in the same solr cloud collection as the 
models. So the syntax would be:

model(collection, modelID)




 


> Optimizing, storing and deploying AI models with Streaming Expressions
> --
>
> Key: SOLR-9258
> URL: https://issues.apache.org/jira/browse/SOLR-9258
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
> Fix For: 6.2
>
> Attachments: SOLR-9258.patch
>
>
> This ticket describes a framework for *optimizing*, *storing* and *deploying* 
> AI models within the Streaming Expression framework.
> *Optimizing*
> [~caomanhdat], has contributed SOLR-9252 which provides *Streaming 
> Expressions* for both feature selection and optimization of a logistic 
> regression text classifier. SOLR-9252 also provides a great working example 
> of *optimization* of a machine learning model using an in-place parallel 
> iterative algorithm.
> *Storing*
> Both features and optimized models can be stored in SolrCloud collections 
> using the update expression. Using [~caomanhdat]'s example in SOLR-9252, the 
> pseudo code for storing features would be:
> {code}
> update(featuresCollection, 
>featuresSelection(collection1, 
> id="myFeatures", 
> q="*:*",  
> field="tv_text", 
> outcome="out_i", 
> positiveLabel=1, 
> numTerms=100))
> {code}  
> The id field can be added to the featureSelection expression so that features 
> can be later retrieved from the collection it's stored in.
> *Deploying*
> With the introduction of the topic() expression, SolrCloud can be treated as 
> a distributed message queue. This messaging capability can  be used to deploy 
> models and process data through the models.
> To implement this approach a classify() function can be created that uses a 
> topic() function to return both the model and the data to be classified:
> The pseudo code looks like this:
> {code}
> classify(topic(models, q="modelID", fl="features, weights"),
>  topic(emails, q="*:*", fl="id, body", rows="500", version="3232323"))
> {code}
> In the example above the classify() function uses the topic() function to 
> retrieve the model. Each time there is an update to the model in the index, 
> the topic() expression will automatically read the new model.
> The topic function() is also used to pull in the data set that is being 
> classified. Notice the *version* parameter. This will be added to the topic 
> function to support pulling results from a specific version number (jira 
> ticket to follow).
> With this approach both the model and the data to process through the model 
> are treated as messages in a message queue.
> The daemon function can be used to send the classify function to Solr where 
> it will be run in the background. The pseudo code looks like this:
> {code}
> daemon(...,
>  update(classifiedEmails, 
>  classify(topic(models, q="modelID", fl="features, weights"),
>   topic(emails, q="*:*", fl="id, fl, body", 
> rows="500", version="3232323"
> {code}
> In this scenario the daemon will run the classify function repeatedly in the 
> background. With each run the topic() functions will re-pull the model if the 
> model has been updated. It will also pull a new set of emails to be 
> classified. The classified emails can be stored in another SolrCloud 
> collection using the update() function.
> Using this approach emails can be classified in batches. The daemon can 
> continue to run even after all all the emails have been classified. New 
> emails added to the emails collections will then be 

[jira] [Commented] (SOLR-9258) Optimizing, storing and deploying AI models with Streaming Expressions

2016-09-13 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487739#comment-15487739
 ] 

Joel Bernstein commented on SOLR-9258:
--

Ok, first question.

Why is the ClassifyStream an inner class of the StreamHandler?

> Optimizing, storing and deploying AI models with Streaming Expressions
> --
>
> Key: SOLR-9258
> URL: https://issues.apache.org/jira/browse/SOLR-9258
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
> Fix For: 6.2
>
> Attachments: SOLR-9258.patch
>
>
> This ticket describes a framework for *optimizing*, *storing* and *deploying* 
> AI models within the Streaming Expression framework.
> *Optimizing*
> [~caomanhdat], has contributed SOLR-9252 which provides *Streaming 
> Expressions* for both feature selection and optimization of a logistic 
> regression text classifier. SOLR-9252 also provides a great working example 
> of *optimization* of a machine learning model using an in-place parallel 
> iterative algorithm.
> *Storing*
> Both features and optimized models can be stored in SolrCloud collections 
> using the update expression. Using [~caomanhdat]'s example in SOLR-9252, the 
> pseudo code for storing features would be:
> {code}
> update(featuresCollection, 
>featuresSelection(collection1, 
> id="myFeatures", 
> q="*:*",  
> field="tv_text", 
> outcome="out_i", 
> positiveLabel=1, 
> numTerms=100))
> {code}  
> The id field can be added to the featureSelection expression so that features 
> can be later retrieved from the collection it's stored in.
> *Deploying*
> With the introduction of the topic() expression, SolrCloud can be treated as 
> a distributed message queue. This messaging capability can  be used to deploy 
> models and process data through the models.
> To implement this approach a classify() function can be created that uses a 
> topic() function to return both the model and the data to be classified:
> The pseudo code looks like this:
> {code}
> classify(topic(models, q="modelID", fl="features, weights"),
>  topic(emails, q="*:*", fl="id, body", rows="500", version="3232323"))
> {code}
> In the example above the classify() function uses the topic() function to 
> retrieve the model. Each time there is an update to the model in the index, 
> the topic() expression will automatically read the new model.
> The topic function() is also used to pull in the data set that is being 
> classified. Notice the *version* parameter. This will be added to the topic 
> function to support pulling results from a specific version number (jira 
> ticket to follow).
> With this approach both the model and the data to process through the model 
> are treated as messages in a message queue.
> The daemon function can be used to send the classify function to Solr where 
> it will be run in the background. The pseudo code looks like this:
> {code}
> daemon(...,
>  update(classifiedEmails, 
>  classify(topic(models, q="modelID", fl="features, weights"),
>   topic(emails, q="*:*", fl="id, fl, body", 
> rows="500", version="3232323"
> {code}
> In this scenario the daemon will run the classify function repeatedly in the 
> background. With each run the topic() functions will re-pull the model if the 
> model has been updated. It will also pull a new set of emails to be 
> classified. The classified emails can be stored in another SolrCloud 
> collection using the update() function.
> Using this approach emails can be classified in batches. The daemon can 
> continue to run even after all all the emails have been classified. New 
> emails added to the emails collections will then be automatically classified 
> when they enter the index.
> Classification can be done in parallel once SOLR-9240 is completed. This will 
> allow topic() results to be partitioned across worker nodes so they can be 
> processed in parallel. The pseudo code for this is:
> {code}
> parallel(workerCollection, worker="20", ...,
>  daemon(...,
>update(classifiedEmails, 
>classify(topic(models, q="modelID", fl="features, 
> weights", partitionKeys="none"),
> topic(emails, q="*:*", fl="id, fl, body", 
> rows="500", version="3232323", partitionKeys="id")
> {code}
> The code above sends a daemon to 20 workers, which will each classify a 
> partition of records pulled by the topic() function.
> *AI based alerting*
> If the 

[jira] [Commented] (SOLR-9258) Optimizing, storing and deploying AI models with Streaming Expressions

2016-09-13 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487408#comment-15487408
 ] 

Joel Bernstein commented on SOLR-9258:
--

Thanks for working on this!  I'll start reviewing this patch today.

Let's create a new ticket for the *classify expression* and link it to this 
ticket. There will be a number of approaches for deploying models, which we can 
build in separate tickets and link to this one.

> Optimizing, storing and deploying AI models with Streaming Expressions
> --
>
> Key: SOLR-9258
> URL: https://issues.apache.org/jira/browse/SOLR-9258
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
> Fix For: 6.2
>
> Attachments: SOLR-9258.patch
>
>
> This ticket describes a framework for *optimizing*, *storing* and *deploying* 
> AI models within the Streaming Expression framework.
> *Optimizing*
> [~caomanhdat], has contributed SOLR-9252 which provides *Streaming 
> Expressions* for both feature selection and optimization of a logistic 
> regression text classifier. SOLR-9252 also provides a great working example 
> of *optimization* of a machine learning model using an in-place parallel 
> iterative algorithm.
> *Storing*
> Both features and optimized models can be stored in SolrCloud collections 
> using the update expression. Using [~caomanhdat]'s example in SOLR-9252, the 
> pseudo code for storing features would be:
> {code}
> update(featuresCollection, 
>featuresSelection(collection1, 
> id="myFeatures", 
> q="*:*",  
> field="tv_text", 
> outcome="out_i", 
> positiveLabel=1, 
> numTerms=100))
> {code}  
> The id field can be added to the featureSelection expression so that features 
> can be later retrieved from the collection it's stored in.
> *Deploying*
> With the introduction of the topic() expression, SolrCloud can be treated as 
> a distributed message queue. This messaging capability can  be used to deploy 
> models and process data through the models.
> To implement this approach a classify() function can be created that uses a 
> topic() function to return both the model and the data to be classified:
> The pseudo code looks like this:
> {code}
> classify(topic(models, q="modelID", fl="features, weights"),
>  topic(emails, q="*:*", fl="id, body", rows="500", version="3232323"))
> {code}
> In the example above the classify() function uses the topic() function to 
> retrieve the model. Each time there is an update to the model in the index, 
> the topic() expression will automatically read the new model.
> The topic function() is also used to pull in the data set that is being 
> classified. Notice the *version* parameter. This will be added to the topic 
> function to support pulling results from a specific version number (jira 
> ticket to follow).
> With this approach both the model and the data to process through the model 
> are treated as messages in a message queue.
> The daemon function can be used to send the classify function to Solr where 
> it will be run in the background. The pseudo code looks like this:
> {code}
> daemon(...,
>  update(classifiedEmails, 
>  classify(topic(models, q="modelID", fl="features, weights"),
>   topic(emails, q="*:*", fl="id, fl, body", 
> rows="500", version="3232323"
> {code}
> In this scenario the daemon will run the classify function repeatedly in the 
> background. With each run the topic() functions will re-pull the model if the 
> model has been updated. It will also pull a new set of emails to be 
> classified. The classified emails can be stored in another SolrCloud 
> collection using the update() function.
> Using this approach emails can be classified in batches. The daemon can 
> continue to run even after all all the emails have been classified. New 
> emails added to the emails collections will then be automatically classified 
> when they enter the index.
> Classification can be done in parallel once SOLR-9240 is completed. This will 
> allow topic() results to be partitioned across worker nodes so they can be 
> processed in parallel. The pseudo code for this is:
> {code}
> parallel(workerCollection, worker="20", ...,
>  daemon(...,
>update(classifiedEmails, 
>classify(topic(models, q="modelID", fl="features, 
> weights", partitionKeys="none"),
> topic(emails, q="*:*", fl="id, fl, body", 
> rows="500", version="3232323",