[
https://issues.apache.org/jira/browse/SOLR-8492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15086453#comment-15086453
]
Joel Bernstein commented on SOLR-8492:
--------------------------------------
I think the difference might be I actually used a *Gradient Ascent* algorithm.
I did not describe this correctly in the description.
> Add LogisticRegressionQuery and LogitStream
> -------------------------------------------
>
> Key: SOLR-8492
> URL: https://issues.apache.org/jira/browse/SOLR-8492
> Project: Solr
> Issue Type: New Feature
> Reporter: Joel Bernstein
> Attachments: SOLR-8492.patch, SOLR-8492.patch, SOLR-8492.patch
>
>
> This ticket is to add a new query called a LogisticRegressionQuery (LRQ).
> The LRQ extends AnalyticsQuery
> (http://joelsolr.blogspot.com/2015/12/understanding-solrs-analyticsquery.html)
> and returns a DelegatingCollector that implements a Stochastic Gradient
> Descent (SGD) optimizer for Logistic Regression.
> This ticket also adds the LogitStream which leverages Streaming Expressions
> to provide iteration over the shards. Each call to LogitStream.read() calls
> down to the shards and executes the LogisticRegressionQuery. The model data
> is collected from the shards and the weights are averaged and sent back to
> the shards with the next iteration. Each call to read() returns a Tuple with
> the averaged weights and error from the shards. With this approach the
> LogitStream streams the changing model back to the client after each
> iteration.
> The LogitStream will return the EOF Tuple when it reaches the defined
> maxIterations. When sent as a Streaming Expression to the Stream handler this
> provides parallel iterative behavior. This same approach can be used to
> implement other parallel iterative algorithms.
> The initial patch has a test which simply tests the mechanics of the
> iteration. More work will need to be done to ensure the SGD is properly
> implemented. The distributed approach of the SGD will also need to be
> reviewed.
> This implementation is designed for use cases with a small number of features
> because each feature is it's own discreet field.
> An implementation which supports a higher number of features would be
> possible by packing features into a byte array and storing as binary
> DocValues.
> This implementation is designed to support a large sample set. With a large
> number of shards, a sample set into the billions may be possible.
> sample Streaming Expression Syntax:
> {code}
> logit(collection1, features="a,b,c,d,e,f" outcome="x" maxIterations="80")
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]