[jira] [Updated] (SOLR-8492) Add LogisticRegressionQuery and LogitStream

Joel Bernstein (JIRA) Tue, 05 Jan 2016 13:12:00 -0800

     [ 
https://issues.apache.org/jira/browse/SOLR-8492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joel Bernstein updated SOLR-8492:
---------------------------------
    Description: 
This ticket is to add a new query type called a LogisticRegressionQuery (LRQ).

The LRQ extends AnalyticsQuery 
(http://joelsolr.blogspot.com/2015/12/understanding-solrs-analyticsquery.html) 
and returns a DelegatingCollector that implements a Stochastic Gradient Descent 
(SGD) optimizer for Logistic Regression.

This ticket also adds the LogitStream which leverages Streaming Expressions to 
provide iteration over the shards. Each call to LogitStream.read() calls down 
to shards and executes the LogisticRegressionQuery. The model data is collected 
from the shards and the weights are averaged and sent back to the shards with 
the next iteration. Each call to read() returns a Tuple with the averaged 
weights and error from the shards. With this approach the LogitStream streams 
the changing model back to the client after each iteration.

The LogitStream will return the EOF Tuple when it reaches the defined 
maxIterations. When sent as a Streaming Expression to the Stream handler this 
provides parallel iterative behavior. This same approach can be used to 
implement other parallel iterative algorithms.

The initial patch has  a test which simply tests the mechanics of the 
iteration. More work will need to be done to ensure the SGD is properly 
implemented. The distributed approach of the SGD will also need to be reviewed. 
 

This implementation is designed for use cases with a small number of features 
because each feature is it's own discreet field.

An implementation which supports a higher number of features would be possible 
by packing features into a byte array and storing as binary DocValues.

This implementation is designed to support a large sample set. With a large 
number of shards, a sample set into the billions may be possible.

  was:
This ticket is to add a new query type called a LogisticRegressionQuery (LRQ).

The LRQ extends AnalyticsQuery 
(http://joelsolr.blogspot.com/2015/12/understanding-solrs-analyticsquery.html) 
and returns a DelegatingCollector that implements a Stochastic Gradient Descent 
(SGD) optimizer for Logistic Regression.

This ticket also adds the LogitStream which leverages Streaming Expressions to 
provide iteration over the shards. Each call to LogitStream.read() calls down 
to shards and executes the LogisticRegressionQuery. The model data is collected 
from the shards and the weights are averaged and sent back to the shards for 
the next iteration. Each call to read() returns a Tuple with the averaged 
weights and error from the shards. With this approach the LogitStream streams 
the changing model back to the client after each iteration.

The LogitStream will return the EOF Tuple when it reaches the defined 
maxIterations. When sent as a Streaming Expression to the Stream handler this 
provides parallel iterative behavior. This same approach can be used to 
implement other parallel iterative algorithms.

The initial patch has  a test which simply tests the mechanics of the 
iteration. More work will need to be done to ensure the SGD is properly 
implemented. The distributed approach of the SGD will also need to be reviewed. 
 

This implementation is designed for use cases with a small number of features 
because each feature is it's own discreet field.

An implementation which supports a higher number of features would be possible 
by packing features into a byte array and storing as binary DocValues.

This implementation is designed to support a large sample set. With a large 
number of shards, a sample set into the billions may be possible.


> Add LogisticRegressionQuery and LogitStream
> -------------------------------------------
>
>                 Key: SOLR-8492
>                 URL: https://issues.apache.org/jira/browse/SOLR-8492
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Joel Bernstein
>         Attachments: SOLR-8492.patch
>
>
> This ticket is to add a new query type called a LogisticRegressionQuery (LRQ).
> The LRQ extends AnalyticsQuery 
> (http://joelsolr.blogspot.com/2015/12/understanding-solrs-analyticsquery.html)
>  and returns a DelegatingCollector that implements a Stochastic Gradient 
> Descent (SGD) optimizer for Logistic Regression.
> This ticket also adds the LogitStream which leverages Streaming Expressions 
> to provide iteration over the shards. Each call to LogitStream.read() calls 
> down to shards and executes the LogisticRegressionQuery. The model data is 
> collected from the shards and the weights are averaged and sent back to the 
> shards with the next iteration. Each call to read() returns a Tuple with the 
> averaged weights and error from the shards. With this approach the 
> LogitStream streams the changing model back to the client after each 
> iteration.
> The LogitStream will return the EOF Tuple when it reaches the defined 
> maxIterations. When sent as a Streaming Expression to the Stream handler this 
> provides parallel iterative behavior. This same approach can be used to 
> implement other parallel iterative algorithms.
> The initial patch has  a test which simply tests the mechanics of the 
> iteration. More work will need to be done to ensure the SGD is properly 
> implemented. The distributed approach of the SGD will also need to be 
> reviewed.  
> This implementation is designed for use cases with a small number of features 
> because each feature is it's own discreet field.
> An implementation which supports a higher number of features would be 
> possible by packing features into a byte array and storing as binary 
> DocValues.
> This implementation is designed to support a large sample set. With a large 
> number of shards, a sample set into the billions may be possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-8492) Add LogisticRegressionQuery and LogitStream

Reply via email to