[ 
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15415096#comment-15415096
 ] 

Cao Manh Dat commented on SOLR-9252:
------------------------------------

I mean we should ignore that documents inside training for loop

So it will be
{code}
for (Map.Entry<Integer, double[]> entry : docVectors.entrySet()) {
  ...
}
{code}
to
{code}
for (Map.Entry<Integer, double[]> entry : docVectors.entrySet()) {
  if (isZeros(vector)) continue
  ...
}
{code}

Because we will have same zero vectors which have different label (both 
positive and negative).
I will submit a patch soon to include this change and regularization.


> Feature selection and logistic regression on text
> -------------------------------------------------
>
>                 Key: SOLR-9252
>                 URL: https://issues.apache.org/jira/browse/SOLR-9252
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: search, SolrCloud, SolrJ
>            Reporter: Cao Manh Dat
>            Assignee: Joel Bernstein
>              Labels: Streaming
>             Fix For: 6.2
>
>         Attachments: SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, 
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch
>
>
> This ticket adds two new streaming expressions: *features* and *train*
> These two functions work together to train a logistic regression model on 
> text, from a training set stored in a SolrCloud collection.
> The syntax is as follows:
> {code}
> train(collection1, q="*:*",
>       features(collection1, 
>                q="*:*",  
>                field="body", 
>                outcome="out_i", 
>                positiveLabel=1, 
>                numTerms=100),
>       field="body",
>       outcome="out_i",
>       maxIterations=100)
> {code}
> The *features* function extracts the feature terms from a training set using 
> *information gain* to score the terms. 
> http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf
> The *train* function uses the extracted features to train a logistic 
> regression model on a text field in the training set.
> For both *features* and *train* the training set is defined by a query. The 
> doc vectors in the *train* function use tf-idf to represent the terms in the 
> document. The idf is calculated for the specific training set, allowing 
> multiple training sets to be stored in the same collection without polluting 
> the idf. 
> In the *train* function a batch gradient descent approach is used to 
> iteratively train the model.
> Both the *features* and the *train* function are embedded in Solr using the 
> AnalyticsQuery framework. So only the model is transported across the network 
> with each iteration.
> Both the features and the models can be stored in a SolrCloud collection. 
> Using this approach Solr can hold millions of models which can be selectively 
> deployed. For example a model could be trained for each user, to personalize 
> ranking and recommendations.
> Below is the final iteration of a model trained on the Enron Ham/Spam 
> dataset. The model includes the terms and their idfs and weights as well as a 
> classification evaluation describing the accuracy of model on the training 
> set. 
> {code}
> {
>                       "idfs_ds": [1.2627703388716238, 1.2043595767152093, 
> 1.3886172425360304, 1.5488587854881268, 1.6127302558747882, 
> 2.1359177807201526, 1.514866246141212, 1.7375701403808523, 
> 1.6166175299631897, 1.756428159015249, 1.7929202354640175, 
> 1.2834893120635762, 1.899442866302021, 1.8639061320252337, 
> 1.7631697575821685, 1.6820002892260415, 1.4411352768194767, 
> 2.103708877350535, 1.2225773869965861, 2.208893321170597, 1.878981794430681, 
> 2.043737027506736, 2.2819184561854864, 2.3264563106163885, 
> 1.9336117619172708, 2.0467265663551024, 1.7386696457142692, 
> 2.468795829515302, 2.069437610615317, 2.6294363202479327, 3.7388303845193307, 
> 2.5446615802900157, 1.7430797961918219, 3.0787440662202736, 
> 1.9579702057493114, 2.289523055570706, 1.5362003886162032, 
> 2.7549569891263763, 3.955894889757158, 2.587435396273302, 3.945844553903657, 
> 1.003513057076781, 3.0416264032637708, 2.248395764146843, 4.018415246738492, 
> 2.2876164773001246, 3.3636289340509933, 1.2438124251270097, 
> 2.733903579928544, 3.439026951535205, 0.6709665389201712, 0.9546224358275518, 
> 2.8080115520822657, 2.477970205791343, 2.2631561797299637, 
> 3.2378087608499606, 0.36177021415584676, 4.1083634834014315, 
> 4.120197941048435, 2.471081544796158, 2.4241455557775633, 2.923393626201111, 
> 2.9269972337044097, 3.2987413118451183, 2.383498249003407, 4.168988105217867, 
> 2.877691472720256, 4.233526626355437, 3.8505343740993316, 2.3264563106163885, 
> 2.6429318017228174, 4.260555298743357, 3.0058372954121855, 
> 3.8688835127675283, 3.021585652380325, 3.0295538220295017, 
> 1.9620882623582288, 3.469610374907285, 3.945844553903657, 3.4821105376715167, 
> 4.3169082352944885, 2.520329479630485, 3.609372317282444, 3.070375816549757, 
> 4.220281399605417, 3.9866665484239117, 3.6165408067610563, 3.788840805093992, 
> 4.392131656532076, 4.392131656532076, 2.837281934382379, 3.698984475972131, 
> 4.331507034715641, 2.360699334038601, 2.7368842080666815, 3.730733174286711, 
> 3.1991566064156816, 4.4238803548466565, 2.4665153268165767, 
> 3.175736332207583, 3.2378087608499606, 4.376627469996111, 3.3525177086259226, 
> 3.28315658082842, 4.156565585219309, 1.6462639699299098, 2.673278958112109, 
> 4.331507034715641, 3.955894889757158, 2.7764631943473397, 3.0497565293470212, 
> 1.79060004880832, 3.6237610547345436, 1.6244377066690232, 2.948895919012047, 
> 3.175736332207583, 2.850571166501062, 4.073677925413541, 2.725014632511298, 
> 3.1573871935393867, 4.562030693327474, 3.5403794457954922, 4.580722826339627, 
> 4.580722826339627, 3.189722574182323, 3.1665196771026594, 3.3306589148134234, 
> 1.9745451708435238, 3.3306589148134234, 2.795272526304836, 
> 3.3415285870503273, 4.407880013500216, 4.4238803548466565, 
> 2.6902285164258823, 3.668212817305377, 4.543681554659277, 2.559550192783766, 
> 1.5452257206382456, 2.2631561797299637, 4.659194441781121, 
> 3.2678110111537597, 3.878185905429842, 3.3525177086259226, 3.374865007317919, 
> 3.780330115426083, 4.376627469996111, 3.433020927474993, 3.6758174166905966, 
> 4.288334862850433, 3.2378087608499606, 4.490571729345329, 2.9269972337044097, 
> 4.029226162842708, 3.0538465145985465, 4.440140875718437, 3.533734903076824, 
> 4.659194441781121, 4.659194441781121, 4.525663049156599, 3.706827653433157, 
> 3.1172927363375087, 4.490571729345329, 2.552078177945065, 2.087985282971078, 
> 4.83744267318744, 4.562030693327474, 4.09666744363824, 4.659194441781121, 
> 1.802255192400069, 4.599771021310321, 3.788840805093992, 4.8621352857778115, 
> 4.6798137289838575, 4.376627469996111, 3.272900080661231, 3.8970543897342247, 
> 4.638991734463602, 4.638991734463602, 4.813345121608379, 4.813345121608379, 
> 4.8621352857778115, 4.83744267318744, 3.588170109631841, 4.13217413209515, 
> 4.599771021310321, 4.331507034715641, 3.134914337687328, 4.525663049156599, 
> 4.722373343402653, 3.955894889757158, 4.967495801435638, 4.580722826339627, 
> 4.967495801435638, 4.9134285801653625, 4.887453093762102, 4.407880013500216, 
> 4.246949646687578, 2.198385343572182, 1.5963758750107606, 4.007719957621744],
>                       "alpha_d": 7.150861416624748E-4,
>                       "terms_ss": ["enron", "2000", "cc", "hpl", "daren", 
> "http", "gas", "forwarded", "pm", "ect", "hou", "thanks", "meter", "2001", 
> "attached", "deal", "am", "farmer", "your", "nom", "corp", "more", "mmbtu", 
> "xls", "here", "j", "let", "volumes", "questions", "www", "2004", "sitara", 
> "no", "money", "01", "volume", "know", "best", "meds", "bob", "prescription", 
> "please", "online", "file", "viagra", "02", "stop", "me", "nomination", "v", 
> "on", "i", "click", "texas", "03", "prices", "for", "paliourg", "php", "09", 
> "contract", "fyi", "actuals", "u", "04", "pain", "713", "drugs", "microsoft", 
> "email", "robert", "cialis", "melissa", "investment", "teco", "pat", "11", 
> "save", "professional", "world", "biz", "flow", "dollars", "noms", "2005", 
> "act", "remove", "results", "soft", "xp", "mary", "80", "spam", "following", 
> "06", "software", "n", "dealer", "08", "ena", "offer", "sex", "products", 
> "special", "compliance", "see", "free", "cheap", "html", "07", "gary", "000", 
> "low", "our", "houston", "many", "april", "size", "r", "tap", "lots", 
> "product", "pills", "xanax", "vance", "ami", "chokshi", "12", "clynes", 
> "ticket", "counterparty", "super", "thousand", "daily", "offers", "weight", 
> "05", "all", "call", "photoshop", "julie", "stock", "lisa", "steve", 
> "million", "health", "site", "quality", "stocks", "link", "featured", "net", 
> "international", "most", "investing", "works", "readers", "uncertainties", 
> "differ", "news", "david", "seek", "31", "only", "1933", "creative", 
> "windows", "subscribers", "should", "adobe", "security", "1934", "valium", 
> "brand", "visit", "action", "canon", "pharmacy", "sexual", "inherent", 
> "construed", "assumptions", "internet", "mobile", "risks", "wide", "smith", 
> "ex", "pill", "states", "projections", "medications", "predictions", 
> "anticipates", "deciding", "events", "advice", "now", "com", "browser"],
>                       "iteration_i": 100,
>                       "weights_ds": [0.9524452699893067, -2.9257423290160225, 
> -2.122240862520573, -0.40259380863176036, -1.242508927269482, 
> -2.1933952666745924, 0.9119553386109202, -1.3359582128074137, 
> -1.1717690853817335, -0.9029380383621088, -1.970576222154978, 
> -0.9180539343040344, -2.031736167842155, -1.382820037232718, 
> -1.4296530557007743, -1.5015080966872794, -0.852373483913152, 
> -0.2883706803921614, -0.2366741375717678, 0.2966401203916763, 
> -0.6792566685980972, -0.18912751254722837, 0.10265566994945839, 
> -1.0065678789783332, -0.8967357570889625, 0.041722607774742765, 
> -0.2832721589409925, -0.400560390908784, -0.6945385025086017, 
> -0.8488391208665993, -0.31851465800191403, 1.570768257518063, 
> -1.5144615060332418, 0.9411280928801138, 0.738478999511349, 
> -0.6875177906594712, -0.47841730767672286, -0.20502227184813, 
> 0.4858041557455349, 1.389551367014946, -0.8886199496843126, 
> 0.8029699876855549, -0.7760217032166719, 0.40175437931353053, 
> -0.6231018791954438, 1.0261571991645586, -0.44254206613371744, 
> 0.31955072203529183, -0.24171600421157927, -0.632533557090375, 
> 0.774533771979748, -1.1164595912116915, -0.2954704188664946, 
> 0.27653823698423186, -1.157867306631878, -5.49332153268076E-5, 
> 0.6916900118076985, -1.305726586870522, 1.370623007467874, 
> 1.1100575515185573, 0.40953153124448194, -0.4273267120664356, 
> -0.5536271317082946, -0.03575915648164506, 0.20475308352558616, 
> -0.2919021960690356, 1.1094392826383312, -1.24904822249928, 
> 1.038764158800864, 0.10525284214114823, 0.1973739189626828, 
> -0.33283870614700184, 1.0555375704790861, 0.25856879498650104, 
> 0.921918816504445, -0.15711181528461088, -0.3594966291171786, 
> -0.6659758614594922, -0.3342439009175488, 0.3592708173532555, 
> 0.12872616265365205, 1.362140022970902, -0.2699930594417464, 
> 0.7449118829650243, -0.12665949567352622, 1.1289376146405283, 
> 0.1653713075673579, 0.7008424353370497, 0.47095485852014707, 
> 1.021689093687625, 1.0049928692400525, -0.18114402652386635, 
> 0.4403400905532737, 1.0570966104647033, -1.167541821576636, 
> -0.4428853975686944, 0.20694894484760668, 0.15472835818468766, 
> 1.0009582999260647, 0.013730849275970687, -0.3882888402977611, 
> 0.14102499499877702, 1.1560852477692065, -0.822855520787489, 
> -0.1468595831916683, 0.9069870716505091, -0.18884872126960675, 
> -0.19213990843838719, -0.0032534107278622496, 0.2715800337813452, 
> 0.0888346122807297, -0.37031213468904256, -0.07224227291981163, 
> 0.08850381657180348, 0.20501283264716516, -0.5852130122059844, 
> 0.11807896760332989, -1.3196626232666966, 0.5324969558412787, 
> 0.7667504164777665, 0.11805357030082002, 1.0020954114301253, 
> -0.10885082229805468, 1.003094962524753, 1.0000914796917044, 
> 0.0094959191513861, -0.5127276009526891, 0.059129413669497796, 
> -0.49311249434449955, 0.34652229330274653, -0.7618731785587705, 
> -0.3514318991274448, 0.7742232232987654, 0.7575763908124484, 
> -0.25192129997930635, -0.24220187762559128, 1.0014232005812307, 
> -0.3453736248293833, -0.1121687186012911, -0.15547543099631278, 
> 1.0840890597241875, -0.2879034857435273, -0.227656977034567, 
> -0.3716602841157388, 0.18007113168986144, 0.8297688092273079, 
> 1.405797209837956, 0.3921445898278919, 1.079363745455813, 
> -0.6253022693091732, 0.33155358331572704, 0.9644709831096733, 
> -0.19686285814583682, 1.1069098903214452, -0.19597970694899214, 
> -0.29329229099344734, -0.037185151648282316, 1.0010206696926418, 
> 1.0096586146138415, 0.9523090849946898, 0.34253175617551923, 
> -0.41826608329006, 0.7213729935258942, -0.47416007242000024, 
> 0.3210039942978008, 1.0, 0.9772041721907345, 0.2533596337281238, 
> 0.9839657417973666, -0.7583308570783015, 0.9476391050914625, 
> 0.2534925274818649, 1.0, 1.0001125385832383, 0.37796474985487505, 
> 0.3839828352290301, 0.44224405246124543, 1.046072941713049, 
> 1.1205405856642119, 0.9165436674154628, 0.9586701268580604, 
> 1.0000000000000968, 0.9860828147022696, -0.32499900116244823, 
> 1.1624049652694368, 0.4966278258894532, -0.14840111822378488, 
> 0.15131204240736265, 1.114787005544689, 1.1782663102351227, 
> 0.21291210471466848, 1.0000000000385034, 0.9564718923455356, 
> 1.0110628413440756, 1.000156375636503, 0.9763045864950046, 
> 0.2630059727829917, 0.24199402427272665, 0.2736018381908099, 
> -0.7673296746900424, -0.1899398724099395],
>                       "field_s": "body",
>                       "trueNegative_i": 3570,
>                       "falseNegative_i": 35,
>                       "falsePositive_i": 75,
>                       "error_d": 176.8112932306374,
>                       "truePositive_i": 1381,
>                       "id": "model_100"
>               }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to