[jira] [Commented] (SOLR-8492) Add LogisticRegressionQuery and LogitStream

Cao Manh Dat (JIRA) Tue, 05 Jan 2016 23:41:52 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-8492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15085146#comment-15085146
 ]


Cao Manh Dat commented on SOLR-8492:
------------------------------------

What a wonderful patch. I'm very excited on implementing ml algorithms by using 
streaming.

A couple of comments for this patch:
{code}
//wi = alpha(outcome - sigmoid)*wi + xi
double sig = sigmoid(sum(multiply(vals, weights)));
error = outcome - sig;

workingWeights = sum(vals, multiply(error * alpha, weights));

for(int i=0; i<workingWeights.length; i++) {
  weights[i] = workingWeights[i];
}
{code}
I dont think this formula is correct. Should it be
{code}
// wi = wi - alpha*(sigmoid-outcome) * xi
double sig = sigmoid(sum(multiply(vals, weights)));
error = sig - outcome;

workingWeights = multiply(error * alpha, vals);

for(int i=0; i<workingWeights.length; i++) {
  weights[i] -= workingWeights[i];
}
{code}

This is the implementation of stochastic gradient descent (which update weight 
by single example). Should we just move the update part to collect(int doc)?
{code}
public void collect(int doc) {
  // do the update here
}
{code}

> Add LogisticRegressionQuery and LogitStream
> -------------------------------------------
>
>                 Key: SOLR-8492
>                 URL: https://issues.apache.org/jira/browse/SOLR-8492
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Joel Bernstein
>         Attachments: SOLR-8492.patch
>
>
> This ticket is to add a new query called a LogisticRegressionQuery (LRQ).
> The LRQ extends AnalyticsQuery 
> (http://joelsolr.blogspot.com/2015/12/understanding-solrs-analyticsquery.html)
>  and returns a DelegatingCollector that implements a Stochastic Gradient 
> Descent (SGD) optimizer for Logistic Regression.
> This ticket also adds the LogitStream which leverages Streaming Expressions 
> to provide iteration over the shards. Each call to LogitStream.read() calls 
> down to the shards and executes the LogisticRegressionQuery. The model data 
> is collected from the shards and the weights are averaged and sent back to 
> the shards with the next iteration. Each call to read() returns a Tuple with 
> the averaged weights and error from the shards. With this approach the 
> LogitStream streams the changing model back to the client after each 
> iteration.
> The LogitStream will return the EOF Tuple when it reaches the defined 
> maxIterations. When sent as a Streaming Expression to the Stream handler this 
> provides parallel iterative behavior. This same approach can be used to 
> implement other parallel iterative algorithms.
> The initial patch has  a test which simply tests the mechanics of the 
> iteration. More work will need to be done to ensure the SGD is properly 
> implemented. The distributed approach of the SGD will also need to be 
> reviewed.  
> This implementation is designed for use cases with a small number of features 
> because each feature is it's own discreet field.
> An implementation which supports a higher number of features would be 
> possible by packing features into a byte array and storing as binary 
> DocValues.
> This implementation is designed to support a large sample set. With a large 
> number of shards, a sample set into the billions may be possible.
> sample Streaming Expression Syntax:
> {code}
> logit(collection1, features="a,b,c,d,e,f" outcome="x" maxIterations="80")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-8492) Add LogisticRegressionQuery and LogitStream

Reply via email to