[ https://issues.apache.org/jira/browse/SOLR-12197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joel Bernstein updated SOLR-12197: ---------------------------------- Description: Currently the *train* Streaming Expression trains a logistic regression model by iterating over the entire distributed training set on each training iteration. Each training iteration involves building a matrix on each shard with the number of rows equal to the size of the training set contained on the shard. The number of columns will be the number of features. This scenario can create very large matrices when working with large training sets and feature sets. This ticket will add a *sample* parameter which will limit the size of the training set on each iteration to a random sample of the training set. This will allow for much larger training sets. was: Currently the *train* Streaming Expression trains a logistic regression model by iterating over the entire distributed training set on each pass. Each iteration involves building a matrix on each shard with the number of rows equal to the size of the training set contained on the shard. The number of columns will be the number of features. This scenario can create very large matrices when working with large training sets and feature sets. This ticket will add a *sample* parameter which will limit the size of the training set on each iteration to a random sample of the training set. This will allow for much larger training sets. > Implement sampling for logistic regression classifier > ----------------------------------------------------- > > Key: SOLR-12197 > URL: https://issues.apache.org/jira/browse/SOLR-12197 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: streaming expressions > Reporter: Joel Bernstein > Assignee: Joel Bernstein > Priority: Major > Fix For: 7.4 > > > Currently the *train* Streaming Expression trains a logistic regression model > by iterating over the entire distributed training set on each training > iteration. Each training iteration involves building a matrix on each shard > with the number of rows equal to the size of the training set contained on > the shard. The number of columns will be the number of features. This > scenario can create very large matrices when working with large training sets > and feature sets. > This ticket will add a *sample* parameter which will limit the size of the > training set on each iteration to a random sample of the training set. This > will allow for much larger training sets. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org