[ 
https://issues.apache.org/jira/browse/SOLR-8492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cao Manh Dat updated SOLR-8492:
-------------------------------
    Attachment: SOLR-8492.patch

Add positiveLabel to LogitStream (forget to do that in previous patch). 
[~joel.bernstein] I'm thinking about MultiLogitStream, which each worker train 
a positiveLabel. For example
{quote}
Assume that we have 5 label : 0,1,2,3,4.
Worker 1 will train a model for label 0
Worker 2 will train a model for label 1
.....
{quote} 

I think we cant use ParallelStream here because ParallelStream merge stream 
from workers to a long stream.
{quote}
tuple11 - tuple12 \
tuple21 - tuple22 | -----> tuple11 - tuple21 - tuple31 - tuple12 - ... - EOF
tuple31 - tuple32 /
{quote}

But MultiLogitStream merge tuples from all worker's stream to a single tuple
{quote}
t11 - t12 \
t21 - t22 | -----> merge(t11,t21,t31) - merge(t12,t22,t32) - EOF
t31 - t32 /
{quote}

Should we call it {code} ParallelReducerStream {code}

> Add LogisticRegressionQuery and LogitStream
> -------------------------------------------
>
>                 Key: SOLR-8492
>                 URL: https://issues.apache.org/jira/browse/SOLR-8492
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Joel Bernstein
>         Attachments: SOLR-8492.patch, SOLR-8492.patch, SOLR-8492.patch, 
> SOLR-8492.patch, SOLR-8492.patch, SOLR-8492.patch
>
>
> This ticket is to add a new query called a LogisticRegressionQuery (LRQ).
> The LRQ extends AnalyticsQuery 
> (http://joelsolr.blogspot.com/2015/12/understanding-solrs-analyticsquery.html)
>  and returns a DelegatingCollector that implements a Stochastic Gradient 
> Descent (SGD) optimizer for Logistic Regression.
> This ticket also adds the LogitStream which leverages Streaming Expressions 
> to provide iteration over the shards. Each call to LogitStream.read() calls 
> down to the shards and executes the LogisticRegressionQuery. The model data 
> is collected from the shards and the weights are averaged and sent back to 
> the shards with the next iteration. Each call to read() returns a Tuple with 
> the averaged weights and error from the shards. With this approach the 
> LogitStream streams the changing model back to the client after each 
> iteration.
> The LogitStream will return the EOF Tuple when it reaches the defined 
> maxIterations. When sent as a Streaming Expression to the Stream handler this 
> provides parallel iterative behavior. This same approach can be used to 
> implement other parallel iterative algorithms.
> The initial patch has  a test which simply tests the mechanics of the 
> iteration. More work will need to be done to ensure the SGD is properly 
> implemented. The distributed approach of the SGD will also need to be 
> reviewed.  
> This implementation is designed for use cases with a small number of features 
> because each feature is it's own discreet field.
> An implementation which supports a higher number of features would be 
> possible by packing features into a byte array and storing as binary 
> DocValues.
> This implementation is designed to support a large sample set. With a large 
> number of shards, a sample set into the billions may be possible.
> sample Streaming Expression Syntax:
> {code}
> logit(collection1, features="a,b,c,d,e,f" outcome="x" maxIterations="80")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to