[ https://issues.apache.org/jira/browse/SOLR-8492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15089149#comment-15089149 ]
Joel Bernstein commented on SOLR-8492: -------------------------------------- I should have a chance to review the latest patch on this ticket over the next couple days. > Add LogisticRegressionQuery and LogitStream > ------------------------------------------- > > Key: SOLR-8492 > URL: https://issues.apache.org/jira/browse/SOLR-8492 > Project: Solr > Issue Type: New Feature > Reporter: Joel Bernstein > Attachments: SOLR-8492.patch, SOLR-8492.patch, SOLR-8492.patch > > > This ticket is to add a new query called a LogisticRegressionQuery (LRQ). > The LRQ extends AnalyticsQuery > (http://joelsolr.blogspot.com/2015/12/understanding-solrs-analyticsquery.html) > and returns a DelegatingCollector that implements a Stochastic Gradient > Descent (SGD) optimizer for Logistic Regression. > This ticket also adds the LogitStream which leverages Streaming Expressions > to provide iteration over the shards. Each call to LogitStream.read() calls > down to the shards and executes the LogisticRegressionQuery. The model data > is collected from the shards and the weights are averaged and sent back to > the shards with the next iteration. Each call to read() returns a Tuple with > the averaged weights and error from the shards. With this approach the > LogitStream streams the changing model back to the client after each > iteration. > The LogitStream will return the EOF Tuple when it reaches the defined > maxIterations. When sent as a Streaming Expression to the Stream handler this > provides parallel iterative behavior. This same approach can be used to > implement other parallel iterative algorithms. > The initial patch has a test which simply tests the mechanics of the > iteration. More work will need to be done to ensure the SGD is properly > implemented. The distributed approach of the SGD will also need to be > reviewed. > This implementation is designed for use cases with a small number of features > because each feature is it's own discreet field. > An implementation which supports a higher number of features would be > possible by packing features into a byte array and storing as binary > DocValues. > This implementation is designed to support a large sample set. With a large > number of shards, a sample set into the billions may be possible. > sample Streaming Expression Syntax: > {code} > logit(collection1, features="a,b,c,d,e,f" outcome="x" maxIterations="80") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org