I made a rendering of the result online https://sensimark.com/
Le dim. 3 juin 2018 à 23:22, Sebastian Raschka <m...@sebastianraschka.com> a écrit : > sorry, I had a copy & paste error, I meant "LogisticRegression(..., > multi_class='multinomial')" and not "LogisticRegression(..., > multi_class='ovr')" > > > On Jun 3, 2018, at 5:19 PM, Sebastian Raschka <m...@sebastianraschka.com> > wrote: > > > > Hi, > > > >> I quickly read about multinomal regression, is it something do you > recommend I use? Maybe you think about something else? > > > > Multinomial regression (or Softmax Regression) should give you results > somewhat similar to a linear SVC (or logistic regression with OvO or OvR). > The theoretical difference is that Softmax regression assumes that the > classes are mutually exclusive, which is probably not the case in your > setting since e.g., an article could be both "Art" and "Science" to some > extend or so. Here a quick summary of softmax regression if useful: > https://sebastianraschka.com/faq/docs/softmax_regression.html. In > scikit-learn, you can use it via LogisticRegression(..., multi_class='ovr'). > > > > Howeever, spontaneously, I would say that Latent Dirichlet Allocation > could be a better choice in your case. I.e., fit the model on the corpus > for a specified number of topics (e.g., 10, but depends on your dataset, I > would experiment a bit here), look at the top words in each topic and then > assign a topic label to each topic. Then, for a given article, you can > assign e.g., the top X labeled topics. > > > > Best, > > Sebastian > > > > > > > > > >> On Jun 3, 2018, at 5:03 PM, Amirouche Boubekki < > amirouche.boube...@gmail.com> wrote: > >> > >> Héllo, > >> > >> I started a natural language processing project a few weeks ago called > wikimark (the code is all in wikimark.py) > >> > >> Given a text it wants to return a dictionary scoring the input against > vital articles categories, e.g.: > >> > >> out = wikimark("""Peter Hintjens wrote about the relation between > technology and culture. Without using a scientifical tone of > state-of-the-art review of the anthroposcene antropology, he gives a fair > amount of food for thought. According to Hintjens, technology is doomed to > become cheap. As matter of fact, intelligence tools will become more and > more accessible which will trigger a revolution to rebalance forces in > society.""") > >> > >> for category, score in out: > >> print('{} ~ {}'.format(category, score)) > >> > >> The above program would output something like that: > >> > >> Art ~ 0.1 > >> Science ~ 0.5 > >> Society ~ 0.4 > >> > >> Except not everything went as planned. Mind the fact that in the above > example the total is equal to 1, but I could not achieve that at all. > >> > >> I am using gensim to compute vectors of paragraphs (doc2vev) and then > submit those vectors to svm.SVR in a one-vs-all strategy ie. a document is > scored 1 if it's in that subcategory and zero otherwise. At prediction > time, it goes though the same doc2vec pipeline. The computer will score > each paragraph against the SVR models of wikipedia vital article > subcategories and get a value between 0 and 1 for each paragraph. I compute > the sum and group by subcategory and then I have a score per category for > the input document > >> > >> It somewhat works. I made a web ui online you can find it at > https://sensimark.com where you can test it. You can directly access the > >> full api e.g. > https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&all=1 > >> > >> The output JSON document is a list of category dictionary where the > prediction key is associated with the average of the "prediction" of the > subcategories. If you replace &all=1 by &top=5 you might get something else > as top categories e.g. > https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&top=10 > >> > >> or > >> > >> > https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&top=5 > >> > >> I wrote "prediction" with double quotes because the value you see, is > the result of some formula. Since, the predictions I get are rather small > between 0 and 0.015 I apply the following formula: > >> value = math.exp(prediction) > >> magic = ((value * 100) - 110) * 100 > >> > >> In order to have values to spread between -200 and 200. Maybe this is > the symptom that my model doesn't work at all. > >> > >> Still, the top 10 results are almost always near each other (try with > BBC articles on https://sensimark.com . It is only when a regression > model is disqualified with a score of 0 that the results are simple to > understand. Sadly, I don't have an example at hand to support that claim. > You have to believe me. > >> > >> I just figured looking at the machine learning map that my problem > might be classification problem, except I don't really want to know what is > the class of new documents, I want to how what are the different subjects > that are dealt in the document based on a hiearchical corpus; > >> I don't want to guess a hiearchy! I want to now how the document > content spread over the different categories or subcategories. > >> > >> I quickly read about multinomal regression, is it something do you > recommend I use? Maybe you think about something else? > >> > >> Also, it seems I should benchmark / evaluate my model against LDA. > >> > >> I am rather noob in terms of datascience and my math skills are not so > fresh. I more likely looking for ideas on what algorithm, fine tuning and > some practice of datascience I must follow that doesn't involve writing my > own algorithm. > >> > >> Thanks in advance! > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn@python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn