sorry, I had a copy & paste error, I meant "LogisticRegression(..., multi_class='multinomial')" and not "LogisticRegression(..., multi_class='ovr')"
> On Jun 3, 2018, at 5:19 PM, Sebastian Raschka <m...@sebastianraschka.com> > wrote: > > Hi, > >> I quickly read about multinomal regression, is it something do you recommend >> I use? Maybe you think about something else? > > Multinomial regression (or Softmax Regression) should give you results > somewhat similar to a linear SVC (or logistic regression with OvO or OvR). > The theoretical difference is that Softmax regression assumes that the > classes are mutually exclusive, which is probably not the case in your > setting since e.g., an article could be both "Art" and "Science" to some > extend or so. Here a quick summary of softmax regression if useful: > https://sebastianraschka.com/faq/docs/softmax_regression.html. In > scikit-learn, you can use it via LogisticRegression(..., multi_class='ovr'). > > Howeever, spontaneously, I would say that Latent Dirichlet Allocation could > be a better choice in your case. I.e., fit the model on the corpus for a > specified number of topics (e.g., 10, but depends on your dataset, I would > experiment a bit here), look at the top words in each topic and then assign a > topic label to each topic. Then, for a given article, you can assign e.g., > the top X labeled topics. > > Best, > Sebastian > > > > >> On Jun 3, 2018, at 5:03 PM, Amirouche Boubekki >> <amirouche.boube...@gmail.com> wrote: >> >> Héllo, >> >> I started a natural language processing project a few weeks ago called >> wikimark (the code is all in wikimark.py) >> >> Given a text it wants to return a dictionary scoring the input against vital >> articles categories, e.g.: >> >> out = wikimark("""Peter Hintjens wrote about the relation between technology >> and culture. Without using a scientifical tone of state-of-the-art review of >> the anthroposcene antropology, he gives a fair amount of food for thought. >> According to Hintjens, technology is doomed to become cheap. As matter of >> fact, intelligence tools will become more and more accessible which will >> trigger a revolution to rebalance forces in society.""") >> >> for category, score in out: >> print('{} ~ {}'.format(category, score)) >> >> The above program would output something like that: >> >> Art ~ 0.1 >> Science ~ 0.5 >> Society ~ 0.4 >> >> Except not everything went as planned. Mind the fact that in the above >> example the total is equal to 1, but I could not achieve that at all. >> >> I am using gensim to compute vectors of paragraphs (doc2vev) and then submit >> those vectors to svm.SVR in a one-vs-all strategy ie. a document is scored 1 >> if it's in that subcategory and zero otherwise. At prediction time, it goes >> though the same doc2vec pipeline. The computer will score each paragraph >> against the SVR models of wikipedia vital article subcategories and get a >> value between 0 and 1 for each paragraph. I compute the sum and group by >> subcategory and then I have a score per category for the input document >> >> It somewhat works. I made a web ui online you can find it at >> https://sensimark.com where you can test it. You can directly access the >> full api e.g. >> https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&all=1 >> >> The output JSON document is a list of category dictionary where the >> prediction key is associated with the average of the "prediction" of the >> subcategories. If you replace &all=1 by &top=5 you might get something else >> as top categories e.g. >> https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&top=10 >> >> or >> >> https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&top=5 >> >> I wrote "prediction" with double quotes because the value you see, is the >> result of some formula. Since, the predictions I get are rather small >> between 0 and 0.015 I apply the following formula: >> value = math.exp(prediction) >> magic = ((value * 100) - 110) * 100 >> >> In order to have values to spread between -200 and 200. Maybe this is the >> symptom that my model doesn't work at all. >> >> Still, the top 10 results are almost always near each other (try with BBC >> articles on https://sensimark.com . It is only when a regression model is >> disqualified with a score of 0 that the results are simple to understand. >> Sadly, I don't have an example at hand to support that claim. You have to >> believe me. >> >> I just figured looking at the machine learning map that my problem might be >> classification problem, except I don't really want to know what is the class >> of new documents, I want to how what are the different subjects that are >> dealt in the document based on a hiearchical corpus; >> I don't want to guess a hiearchy! I want to now how the document content >> spread over the different categories or subcategories. >> >> I quickly read about multinomal regression, is it something do you recommend >> I use? Maybe you think about something else? >> >> Also, it seems I should benchmark / evaluate my model against LDA. >> >> I am rather noob in terms of datascience and my math skills are not so >> fresh. I more likely looking for ideas on what algorithm, fine tuning and >> some practice of datascience I must follow that doesn't involve writing my >> own algorithm. >> >> Thanks in advance! >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn