Cao Manh Dat commented on SOLR-9258:

+1 The patch look great.

> Optimizing, storing and deploying AI models with Streaming Expressions
> ----------------------------------------------------------------------
>                 Key: SOLR-9258
>                 URL: https://issues.apache.org/jira/browse/SOLR-9258
>             Project: Solr
>          Issue Type: New Feature
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Joel Bernstein
>            Assignee: Joel Bernstein
>             Fix For: 6.2
>         Attachments: ModelCache.java, ModelCache.java, SOLR-9258.patch, 
> SOLR-9258.patch
> This ticket describes a framework for *optimizing*, *storing* and *deploying* 
> AI models within the Streaming Expression framework.
> *Optimizing*
> [~caomanhdat], has contributed SOLR-9252 which provides *Streaming 
> Expressions* for both feature selection and optimization of a logistic 
> regression text classifier. SOLR-9252 also provides a great working example 
> of *optimization* of a machine learning model using an in-place parallel 
> iterative algorithm.
> *Storing*
> Both features and optimized models can be stored in SolrCloud collections 
> using the update expression. Using [~caomanhdat]'s example in SOLR-9252, the 
> pseudo code for storing features would be:
> {code}
> update(featuresCollection, 
>        featuresSelection(collection1, 
>                             id="myFeatures", 
>                             q="*:*",  
>                             field="tv_text", 
>                             outcome="out_i", 
>                             positiveLabel=1, 
>                             numTerms=100))
> {code}  
> The id field can be added to the featureSelection expression so that features 
> can be later retrieved from the collection it's stored in.
> *Deploying*
> With the introduction of the topic() expression, SolrCloud can be treated as 
> a distributed message queue. This messaging capability can  be used to deploy 
> models and process data through the models.
> To implement this approach a classify() function can be created that uses a 
> topic() function to return both the model and the data to be classified:
> The pseudo code looks like this:
> {code}
> classify(topic(models, q="modelID", fl="features, weights"),
>          topic(emails, q="*:*", fl="id, body", rows="500", version="3232323"))
> {code}
> In the example above the classify() function uses the topic() function to 
> retrieve the model. Each time there is an update to the model in the index, 
> the topic() expression will automatically read the new model.
> The topic function() is also used to pull in the data set that is being 
> classified. Notice the *version* parameter. This will be added to the topic 
> function to support pulling results from a specific version number (jira 
> ticket to follow).
> With this approach both the model and the data to process through the model 
> are treated as messages in a message queue.
> The daemon function can be used to send the classify function to Solr where 
> it will be run in the background. The pseudo code looks like this:
> {code}
> daemon(...,
>          update(classifiedEmails, 
>                  classify(topic(models, q="modelID", fl="features, weights"),
>                           topic(emails, q="*:*", fl="id, fl, body", 
> rows="500", version="3232323"))))
> {code}
> In this scenario the daemon will run the classify function repeatedly in the 
> background. With each run the topic() functions will re-pull the model if the 
> model has been updated. It will also pull a new set of emails to be 
> classified. The classified emails can be stored in another SolrCloud 
> collection using the update() function.
> Using this approach emails can be classified in batches. The daemon can 
> continue to run even after all all the emails have been classified. New 
> emails added to the emails collections will then be automatically classified 
> when they enter the index.
> Classification can be done in parallel once SOLR-9240 is completed. This will 
> allow topic() results to be partitioned across worker nodes so they can be 
> processed in parallel. The pseudo code for this is:
> {code}
> parallel(workerCollection, worker="20", ...,
>          daemon(...,
>                    update(classifiedEmails, 
>                            classify(topic(models, q="modelID", fl="features, 
> weights", partitionKeys="none"),
>                                     topic(emails, q="*:*", fl="id, fl, body", 
> rows="500", version="3232323", partitionKeys="id")))))
> {code}
> The code above sends a daemon to 20 workers, which will each classify a 
> partition of records pulled by the topic() function.
> *AI based alerting*
> If the *version* parameter is not supplied to the topic stream it will stream 
> only new content from the topic, rather then starting from an older version 
> number.
> In this scenario the topic function behaves like an alert. Pseudo code for 
> alerts look like this:
> {code}
> daemon(...,
>          alert(..., 
>              classify(topic(models, q="modelID", fl="features, weights"),
>                       topic(emails, q="*:*", fl="id, fl, body", rows="500"))))
> {code}
> In the example above an alert() function wraps the classify() function and 
> takes actions based on the classification of documents. Developers can build 
> there own alert functions using the Streaming API and plug them in to provide 
> custom actions.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to