[
https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Cao Manh Dat updated SOLR-9252:
-------------------------------
Attachment: SOLR-9299-1.patch
A minor patch :
- In training step, we will ignore document that dont have any given features.
- Add regularization for logit (
http://www.holehouse.org/mlclass/07_Regularization.html )
> Feature selection and logistic regression on text
> -------------------------------------------------
>
> Key: SOLR-9252
> URL: https://issues.apache.org/jira/browse/SOLR-9252
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Components: search, SolrCloud, SolrJ
> Reporter: Cao Manh Dat
> Assignee: Joel Bernstein
> Labels: Streaming
> Fix For: 6.2
>
> Attachments: SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch,
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch,
> SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9299-1.patch
>
>
> This ticket adds two new streaming expressions: *features* and *train*
> These two functions work together to train a logistic regression model on
> text, from a training set stored in a SolrCloud collection.
> The syntax is as follows:
> {code}
> train(collection1, q="*:*",
> features(collection1,
> q="*:*",
> field="body",
> outcome="out_i",
> positiveLabel=1,
> numTerms=100),
> field="body",
> outcome="out_i",
> maxIterations=100)
> {code}
> The *features* function extracts the feature terms from a training set using
> *information gain* to score the terms.
> http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf
> The *train* function uses the extracted features to train a logistic
> regression model on a text field in the training set.
> For both *features* and *train* the training set is defined by a query. The
> doc vectors in the *train* function use tf-idf to represent the terms in the
> document. The idf is calculated for the specific training set, allowing
> multiple training sets to be stored in the same collection without polluting
> the idf.
> In the *train* function a batch gradient descent approach is used to
> iteratively train the model.
> Both the *features* and the *train* function are embedded in Solr using the
> AnalyticsQuery framework. So only the model is transported across the network
> with each iteration.
> Both the features and the models can be stored in a SolrCloud collection.
> Using this approach Solr can hold millions of models which can be selectively
> deployed. For example a model could be trained for each user, to personalize
> ranking and recommendations.
> Below is the final iteration of a model trained on the Enron Ham/Spam
> dataset. The model includes the terms and their idfs and weights as well as a
> classification evaluation describing the accuracy of model on the training
> set.
> {code}
> {
> "idfs_ds": [1.2627703388716238, 1.2043595767152093,
> 1.3886172425360304, 1.5488587854881268, 1.6127302558747882,
> 2.1359177807201526, 1.514866246141212, 1.7375701403808523,
> 1.6166175299631897, 1.756428159015249, 1.7929202354640175,
> 1.2834893120635762, 1.899442866302021, 1.8639061320252337,
> 1.7631697575821685, 1.6820002892260415, 1.4411352768194767,
> 2.103708877350535, 1.2225773869965861, 2.208893321170597, 1.878981794430681,
> 2.043737027506736, 2.2819184561854864, 2.3264563106163885,
> 1.9336117619172708, 2.0467265663551024, 1.7386696457142692,
> 2.468795829515302, 2.069437610615317, 2.6294363202479327, 3.7388303845193307,
> 2.5446615802900157, 1.7430797961918219, 3.0787440662202736,
> 1.9579702057493114, 2.289523055570706, 1.5362003886162032,
> 2.7549569891263763, 3.955894889757158, 2.587435396273302, 3.945844553903657,
> 1.003513057076781, 3.0416264032637708, 2.248395764146843, 4.018415246738492,
> 2.2876164773001246, 3.3636289340509933, 1.2438124251270097,
> 2.733903579928544, 3.439026951535205, 0.6709665389201712, 0.9546224358275518,
> 2.8080115520822657, 2.477970205791343, 2.2631561797299637,
> 3.2378087608499606, 0.36177021415584676, 4.1083634834014315,
> 4.120197941048435, 2.471081544796158, 2.4241455557775633, 2.923393626201111,
> 2.9269972337044097, 3.2987413118451183, 2.383498249003407, 4.168988105217867,
> 2.877691472720256, 4.233526626355437, 3.8505343740993316, 2.3264563106163885,
> 2.6429318017228174, 4.260555298743357, 3.0058372954121855,
> 3.8688835127675283, 3.021585652380325, 3.0295538220295017,
> 1.9620882623582288, 3.469610374907285, 3.945844553903657, 3.4821105376715167,
> 4.3169082352944885, 2.520329479630485, 3.609372317282444, 3.070375816549757,
> 4.220281399605417, 3.9866665484239117, 3.6165408067610563, 3.788840805093992,
> 4.392131656532076, 4.392131656532076, 2.837281934382379, 3.698984475972131,
> 4.331507034715641, 2.360699334038601, 2.7368842080666815, 3.730733174286711,
> 3.1991566064156816, 4.4238803548466565, 2.4665153268165767,
> 3.175736332207583, 3.2378087608499606, 4.376627469996111, 3.3525177086259226,
> 3.28315658082842, 4.156565585219309, 1.6462639699299098, 2.673278958112109,
> 4.331507034715641, 3.955894889757158, 2.7764631943473397, 3.0497565293470212,
> 1.79060004880832, 3.6237610547345436, 1.6244377066690232, 2.948895919012047,
> 3.175736332207583, 2.850571166501062, 4.073677925413541, 2.725014632511298,
> 3.1573871935393867, 4.562030693327474, 3.5403794457954922, 4.580722826339627,
> 4.580722826339627, 3.189722574182323, 3.1665196771026594, 3.3306589148134234,
> 1.9745451708435238, 3.3306589148134234, 2.795272526304836,
> 3.3415285870503273, 4.407880013500216, 4.4238803548466565,
> 2.6902285164258823, 3.668212817305377, 4.543681554659277, 2.559550192783766,
> 1.5452257206382456, 2.2631561797299637, 4.659194441781121,
> 3.2678110111537597, 3.878185905429842, 3.3525177086259226, 3.374865007317919,
> 3.780330115426083, 4.376627469996111, 3.433020927474993, 3.6758174166905966,
> 4.288334862850433, 3.2378087608499606, 4.490571729345329, 2.9269972337044097,
> 4.029226162842708, 3.0538465145985465, 4.440140875718437, 3.533734903076824,
> 4.659194441781121, 4.659194441781121, 4.525663049156599, 3.706827653433157,
> 3.1172927363375087, 4.490571729345329, 2.552078177945065, 2.087985282971078,
> 4.83744267318744, 4.562030693327474, 4.09666744363824, 4.659194441781121,
> 1.802255192400069, 4.599771021310321, 3.788840805093992, 4.8621352857778115,
> 4.6798137289838575, 4.376627469996111, 3.272900080661231, 3.8970543897342247,
> 4.638991734463602, 4.638991734463602, 4.813345121608379, 4.813345121608379,
> 4.8621352857778115, 4.83744267318744, 3.588170109631841, 4.13217413209515,
> 4.599771021310321, 4.331507034715641, 3.134914337687328, 4.525663049156599,
> 4.722373343402653, 3.955894889757158, 4.967495801435638, 4.580722826339627,
> 4.967495801435638, 4.9134285801653625, 4.887453093762102, 4.407880013500216,
> 4.246949646687578, 2.198385343572182, 1.5963758750107606, 4.007719957621744],
> "alpha_d": 7.150861416624748E-4,
> "terms_ss": ["enron", "2000", "cc", "hpl", "daren",
> "http", "gas", "forwarded", "pm", "ect", "hou", "thanks", "meter", "2001",
> "attached", "deal", "am", "farmer", "your", "nom", "corp", "more", "mmbtu",
> "xls", "here", "j", "let", "volumes", "questions", "www", "2004", "sitara",
> "no", "money", "01", "volume", "know", "best", "meds", "bob", "prescription",
> "please", "online", "file", "viagra", "02", "stop", "me", "nomination", "v",
> "on", "i", "click", "texas", "03", "prices", "for", "paliourg", "php", "09",
> "contract", "fyi", "actuals", "u", "04", "pain", "713", "drugs", "microsoft",
> "email", "robert", "cialis", "melissa", "investment", "teco", "pat", "11",
> "save", "professional", "world", "biz", "flow", "dollars", "noms", "2005",
> "act", "remove", "results", "soft", "xp", "mary", "80", "spam", "following",
> "06", "software", "n", "dealer", "08", "ena", "offer", "sex", "products",
> "special", "compliance", "see", "free", "cheap", "html", "07", "gary", "000",
> "low", "our", "houston", "many", "april", "size", "r", "tap", "lots",
> "product", "pills", "xanax", "vance", "ami", "chokshi", "12", "clynes",
> "ticket", "counterparty", "super", "thousand", "daily", "offers", "weight",
> "05", "all", "call", "photoshop", "julie", "stock", "lisa", "steve",
> "million", "health", "site", "quality", "stocks", "link", "featured", "net",
> "international", "most", "investing", "works", "readers", "uncertainties",
> "differ", "news", "david", "seek", "31", "only", "1933", "creative",
> "windows", "subscribers", "should", "adobe", "security", "1934", "valium",
> "brand", "visit", "action", "canon", "pharmacy", "sexual", "inherent",
> "construed", "assumptions", "internet", "mobile", "risks", "wide", "smith",
> "ex", "pill", "states", "projections", "medications", "predictions",
> "anticipates", "deciding", "events", "advice", "now", "com", "browser"],
> "iteration_i": 100,
> "weights_ds": [0.9524452699893067, -2.9257423290160225,
> -2.122240862520573, -0.40259380863176036, -1.242508927269482,
> -2.1933952666745924, 0.9119553386109202, -1.3359582128074137,
> -1.1717690853817335, -0.9029380383621088, -1.970576222154978,
> -0.9180539343040344, -2.031736167842155, -1.382820037232718,
> -1.4296530557007743, -1.5015080966872794, -0.852373483913152,
> -0.2883706803921614, -0.2366741375717678, 0.2966401203916763,
> -0.6792566685980972, -0.18912751254722837, 0.10265566994945839,
> -1.0065678789783332, -0.8967357570889625, 0.041722607774742765,
> -0.2832721589409925, -0.400560390908784, -0.6945385025086017,
> -0.8488391208665993, -0.31851465800191403, 1.570768257518063,
> -1.5144615060332418, 0.9411280928801138, 0.738478999511349,
> -0.6875177906594712, -0.47841730767672286, -0.20502227184813,
> 0.4858041557455349, 1.389551367014946, -0.8886199496843126,
> 0.8029699876855549, -0.7760217032166719, 0.40175437931353053,
> -0.6231018791954438, 1.0261571991645586, -0.44254206613371744,
> 0.31955072203529183, -0.24171600421157927, -0.632533557090375,
> 0.774533771979748, -1.1164595912116915, -0.2954704188664946,
> 0.27653823698423186, -1.157867306631878, -5.49332153268076E-5,
> 0.6916900118076985, -1.305726586870522, 1.370623007467874,
> 1.1100575515185573, 0.40953153124448194, -0.4273267120664356,
> -0.5536271317082946, -0.03575915648164506, 0.20475308352558616,
> -0.2919021960690356, 1.1094392826383312, -1.24904822249928,
> 1.038764158800864, 0.10525284214114823, 0.1973739189626828,
> -0.33283870614700184, 1.0555375704790861, 0.25856879498650104,
> 0.921918816504445, -0.15711181528461088, -0.3594966291171786,
> -0.6659758614594922, -0.3342439009175488, 0.3592708173532555,
> 0.12872616265365205, 1.362140022970902, -0.2699930594417464,
> 0.7449118829650243, -0.12665949567352622, 1.1289376146405283,
> 0.1653713075673579, 0.7008424353370497, 0.47095485852014707,
> 1.021689093687625, 1.0049928692400525, -0.18114402652386635,
> 0.4403400905532737, 1.0570966104647033, -1.167541821576636,
> -0.4428853975686944, 0.20694894484760668, 0.15472835818468766,
> 1.0009582999260647, 0.013730849275970687, -0.3882888402977611,
> 0.14102499499877702, 1.1560852477692065, -0.822855520787489,
> -0.1468595831916683, 0.9069870716505091, -0.18884872126960675,
> -0.19213990843838719, -0.0032534107278622496, 0.2715800337813452,
> 0.0888346122807297, -0.37031213468904256, -0.07224227291981163,
> 0.08850381657180348, 0.20501283264716516, -0.5852130122059844,
> 0.11807896760332989, -1.3196626232666966, 0.5324969558412787,
> 0.7667504164777665, 0.11805357030082002, 1.0020954114301253,
> -0.10885082229805468, 1.003094962524753, 1.0000914796917044,
> 0.0094959191513861, -0.5127276009526891, 0.059129413669497796,
> -0.49311249434449955, 0.34652229330274653, -0.7618731785587705,
> -0.3514318991274448, 0.7742232232987654, 0.7575763908124484,
> -0.25192129997930635, -0.24220187762559128, 1.0014232005812307,
> -0.3453736248293833, -0.1121687186012911, -0.15547543099631278,
> 1.0840890597241875, -0.2879034857435273, -0.227656977034567,
> -0.3716602841157388, 0.18007113168986144, 0.8297688092273079,
> 1.405797209837956, 0.3921445898278919, 1.079363745455813,
> -0.6253022693091732, 0.33155358331572704, 0.9644709831096733,
> -0.19686285814583682, 1.1069098903214452, -0.19597970694899214,
> -0.29329229099344734, -0.037185151648282316, 1.0010206696926418,
> 1.0096586146138415, 0.9523090849946898, 0.34253175617551923,
> -0.41826608329006, 0.7213729935258942, -0.47416007242000024,
> 0.3210039942978008, 1.0, 0.9772041721907345, 0.2533596337281238,
> 0.9839657417973666, -0.7583308570783015, 0.9476391050914625,
> 0.2534925274818649, 1.0, 1.0001125385832383, 0.37796474985487505,
> 0.3839828352290301, 0.44224405246124543, 1.046072941713049,
> 1.1205405856642119, 0.9165436674154628, 0.9586701268580604,
> 1.0000000000000968, 0.9860828147022696, -0.32499900116244823,
> 1.1624049652694368, 0.4966278258894532, -0.14840111822378488,
> 0.15131204240736265, 1.114787005544689, 1.1782663102351227,
> 0.21291210471466848, 1.0000000000385034, 0.9564718923455356,
> 1.0110628413440756, 1.000156375636503, 0.9763045864950046,
> 0.2630059727829917, 0.24199402427272665, 0.2736018381908099,
> -0.7673296746900424, -0.1899398724099395],
> "field_s": "body",
> "trueNegative_i": 3570,
> "falseNegative_i": 35,
> "falsePositive_i": 75,
> "error_d": 176.8112932306374,
> "truePositive_i": 1381,
> "id": "model_100"
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]