[ https://issues.apache.org/jira/browse/SOLR-8492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15085146#comment-15085146 ]
Cao Manh Dat commented on SOLR-8492: ------------------------------------ What a wonderful patch. I'm very excited on implementing ml algorithms by using streaming. A couple of comments for this patch: {code} //wi = alpha(outcome - sigmoid)*wi + xi double sig = sigmoid(sum(multiply(vals, weights))); error = outcome - sig; workingWeights = sum(vals, multiply(error * alpha, weights)); for(int i=0; i<workingWeights.length; i++) { weights[i] = workingWeights[i]; } {code} I dont think this formula is correct. Should it be {code} // wi = wi - alpha*(sigmoid-outcome) * xi double sig = sigmoid(sum(multiply(vals, weights))); error = sig - outcome; workingWeights = multiply(error * alpha, vals); for(int i=0; i<workingWeights.length; i++) { weights[i] -= workingWeights[i]; } {code} This is the implementation of stochastic gradient descent (which update weight by single example). Should we just move the update part to collect(int doc)? {code} public void collect(int doc) { // do the update here } {code} > Add LogisticRegressionQuery and LogitStream > ------------------------------------------- > > Key: SOLR-8492 > URL: https://issues.apache.org/jira/browse/SOLR-8492 > Project: Solr > Issue Type: New Feature > Reporter: Joel Bernstein > Attachments: SOLR-8492.patch > > > This ticket is to add a new query called a LogisticRegressionQuery (LRQ). > The LRQ extends AnalyticsQuery > (http://joelsolr.blogspot.com/2015/12/understanding-solrs-analyticsquery.html) > and returns a DelegatingCollector that implements a Stochastic Gradient > Descent (SGD) optimizer for Logistic Regression. > This ticket also adds the LogitStream which leverages Streaming Expressions > to provide iteration over the shards. Each call to LogitStream.read() calls > down to the shards and executes the LogisticRegressionQuery. The model data > is collected from the shards and the weights are averaged and sent back to > the shards with the next iteration. Each call to read() returns a Tuple with > the averaged weights and error from the shards. With this approach the > LogitStream streams the changing model back to the client after each > iteration. > The LogitStream will return the EOF Tuple when it reaches the defined > maxIterations. When sent as a Streaming Expression to the Stream handler this > provides parallel iterative behavior. This same approach can be used to > implement other parallel iterative algorithms. > The initial patch has a test which simply tests the mechanics of the > iteration. More work will need to be done to ensure the SGD is properly > implemented. The distributed approach of the SGD will also need to be > reviewed. > This implementation is designed for use cases with a small number of features > because each feature is it's own discreet field. > An implementation which supports a higher number of features would be > possible by packing features into a byte array and storing as binary > DocValues. > This implementation is designed to support a large sample set. With a large > number of shards, a sample set into the billions may be possible. > sample Streaming Expression Syntax: > {code} > logit(collection1, features="a,b,c,d,e,f" outcome="x" maxIterations="80") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org