[
https://issues.apache.org/jira/browse/SOLR-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joel Bernstein reassigned SOLR-9258:
------------------------------------
Assignee: Joel Bernstein
> Optimizing, storing and deploying AI models with Streaming Expressions
> ----------------------------------------------------------------------
>
> Key: SOLR-9258
> URL: https://issues.apache.org/jira/browse/SOLR-9258
> Project: Solr
> Issue Type: New Feature
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Joel Bernstein
> Assignee: Joel Bernstein
> Fix For: 6.2
>
>
> This ticket describes a framework for *optimizing*, *storing* and *deploying*
> AI models within the Streaming Expression framework.
> *Optimizing*
> [~caomanhdat], has contributed SOLR-9252 which provides *Streaming
> Expressions* for both feature selection and optimization of a logistic
> regression text classifier. SOLR-9252 also provides a great working example
> of *optimization* of a machine learning model using an in-place parallel
> iterative algorithm.
> *Storing*
> Both features and optimized models can be stored in SolrCloud collections
> using the update expression. Using [~caomanhdat]'s example in SOLR-9252, the
> pseudo code for storing features would be:
> {code}
> update(featuresCollection,
> featuresSelection(collection1,
> id="myFeatures",
> q="*:*",
> field="tv_text",
> outcome="out_i",
> positiveLabel=1,
> numTerms=100))
> {code}
> The id field can be added to the featureSelection expression so that features
> can be later retrieved from the collection it's stored in.
> *Deploying*
> With the introduction of the topic() expression, SolrCloud can be treated as
> a distributed message queue. This messaging capability can be used to deploy
> models and process data through the models.
> To implement this approach a classify() function can be created that uses a
> topic() function to return both the model and the data to be classified:
> The pseudo code looks like this:
> {code}
> classify(topic(models, q="modelID", fl="features, weights"),
> topic(emails, q="*:*", fl="id, body", rows="500", version="3232323"))
> {code}
> In the example above the classify() function uses the topic() function to
> retrieve the model. Each time there is an update to the model in the index,
> the topic() expression will automatically read the new model.
> The topic function() is also used to pull in the data set that is being
> classified. Notice the *version* parameter. This will be added to the topic
> function to support pulling results from a specific version number (jira
> ticket to follow).
> With this approach both the model and the data to process through the model
> are treated as messages in a message queue.
> The daemon function can be used to send the classify function to Solr where
> it will be run in the background. The pseudo code looks like this:
> {code}
> daemon(...,
> update(classifiedEmails,
> classify(topic(models, q="modelID", fl="features, weights"),
> topic(emails, q="*:*", fl="id, fl, body",
> rows="500", version="3232323"))))
> {code}
> In this scenario the daemon will run the classify function repeatedly in the
> background. With each run the topic() functions will re-pull the model if the
> model has been updated. It will also pull a new set of emails to be
> classified. The classified emails can be stored in another SolrCloud
> collection using the update() function.
> Using this approach emails can be classified in batches. The daemon can
> continue to run even after all all the emails have been classified. New
> emails added to the emails collections will then be automatically classified
> when they enter the index.
> Classification can be done in parallel once SOLR-9240 is completed. This will
> allow topic() results to be partitioned across worker nodes so they can be
> processed in parallel. The pseudo code for this is:
> {code}
> parallel(workerCollection, worker="20", ...,
> daemon(...,
> update(classifiedEmails,
> classify(topic(models, q="modelID", fl="features,
> weights", partitionKeys="none"),
> topic(emails, q="*:*", fl="id, fl, body",
> rows="500", version="3232323", partitionKeys="id")))))
> {code}
> The code above sends a daemon to 20 workers, which will each classify a
> partition of records pulled by the topic() function.
> *AI based alerting*
> If the *version* parameter is not supplied to the topic stream it will stream
> only new content from the topic, rather then starting from an older version
> number.
> In this scenario the topic function behaves like an alert. Pseudo code for
> alerts look like this:
> {code}
> daemon(...,
> alert(...,
> classify(topic(models, q="modelID", fl="features, weights"),
> topic(emails, q="*:*", fl="id, fl, body", rows="500"))))
> {code}
> In the example above an alert() function wraps the classify() function and
> takes actions based on the classification of documents. Developers can build
> there own alert functions using the Streaming API and plug them in to provide
> custom actions.
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]