[ https://issues.apache.org/jira/browse/SOLR-8492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Cao Manh Dat updated SOLR-8492: ------------------------------- Attachment: SOLR-8492.patch Add positiveLabel to LogitStream (forget to do that in previous patch). [~joel.bernstein] I'm thinking about MultiLogitStream, which each worker train a positiveLabel. For example {quote} Assume that we have 5 label : 0,1,2,3,4. Worker 1 will train a model for label 0 Worker 2 will train a model for label 1 ..... {quote} I think we cant use ParallelStream here because ParallelStream merge stream from workers to a long stream. {quote} tuple11 - tuple12 \ tuple21 - tuple22 | -----> tuple11 - tuple21 - tuple31 - tuple12 - ... - EOF tuple31 - tuple32 / {quote} But MultiLogitStream merge tuples from all worker's stream to a single tuple {quote} t11 - t12 \ t21 - t22 | -----> merge(t11,t21,t31) - merge(t12,t22,t32) - EOF t31 - t32 / {quote} Should we call it {code} ParallelReducerStream {code} > Add LogisticRegressionQuery and LogitStream > ------------------------------------------- > > Key: SOLR-8492 > URL: https://issues.apache.org/jira/browse/SOLR-8492 > Project: Solr > Issue Type: New Feature > Reporter: Joel Bernstein > Attachments: SOLR-8492.patch, SOLR-8492.patch, SOLR-8492.patch, > SOLR-8492.patch, SOLR-8492.patch, SOLR-8492.patch > > > This ticket is to add a new query called a LogisticRegressionQuery (LRQ). > The LRQ extends AnalyticsQuery > (http://joelsolr.blogspot.com/2015/12/understanding-solrs-analyticsquery.html) > and returns a DelegatingCollector that implements a Stochastic Gradient > Descent (SGD) optimizer for Logistic Regression. > This ticket also adds the LogitStream which leverages Streaming Expressions > to provide iteration over the shards. Each call to LogitStream.read() calls > down to the shards and executes the LogisticRegressionQuery. The model data > is collected from the shards and the weights are averaged and sent back to > the shards with the next iteration. Each call to read() returns a Tuple with > the averaged weights and error from the shards. With this approach the > LogitStream streams the changing model back to the client after each > iteration. > The LogitStream will return the EOF Tuple when it reaches the defined > maxIterations. When sent as a Streaming Expression to the Stream handler this > provides parallel iterative behavior. This same approach can be used to > implement other parallel iterative algorithms. > The initial patch has a test which simply tests the mechanics of the > iteration. More work will need to be done to ensure the SGD is properly > implemented. The distributed approach of the SGD will also need to be > reviewed. > This implementation is designed for use cases with a small number of features > because each feature is it's own discreet field. > An implementation which supports a higher number of features would be > possible by packing features into a byte array and storing as binary > DocValues. > This implementation is designed to support a large sample set. With a large > number of shards, a sample set into the billions may be possible. > sample Streaming Expression Syntax: > {code} > logit(collection1, features="a,b,c,d,e,f" outcome="x" maxIterations="80") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org