Re: [scikit-learn] Supervised prediction of multiple scores for a document

Sebastian Raschka Sun, 03 Jun 2018 14:24:02 -0700

sorry, I had a copy & paste error, I meant "LogisticRegression(..., 
multi_class='multinomial')" and not "LogisticRegression(..., 
multi_class='ovr')"


> On Jun 3, 2018, at 5:19 PM, Sebastian Raschka <[email protected]> 
> wrote:
> 
> Hi,
> 
>> I quickly read about multinomal regression, is it something do you recommend 
>> I use? Maybe you think about something else? 
> 
> Multinomial regression (or Softmax Regression) should give you results 
> somewhat similar to a linear SVC (or logistic regression with OvO or OvR). 
> The theoretical difference is that Softmax regression assumes that the 
> classes are mutually exclusive, which is probably not the case in your 
> setting since e.g., an article could be both "Art" and "Science" to some 
> extend or so. Here a quick summary of softmax regression if useful: 
> https://sebastianraschka.com/faq/docs/softmax_regression.html. In 
> scikit-learn, you can use it via LogisticRegression(..., multi_class='ovr').
> 
> Howeever, spontaneously, I would say that Latent Dirichlet Allocation could 
> be a better choice in your case. I.e., fit the model on the corpus for a 
> specified number of topics (e.g., 10, but depends on your dataset, I would 
> experiment a bit here), look at the top words in each topic and then assign a 
> topic label to each topic. Then, for a given article, you can assign e.g., 
> the top X labeled topics.
> 
> Best,
> Sebastian
> 
> 
> 
> 
>> On Jun 3, 2018, at 5:03 PM, Amirouche Boubekki 
>> <[email protected]> wrote:
>> 
>> Héllo,
>> 
>> I started a natural language processing project a few weeks ago called 
>> wikimark (the code is all in wikimark.py)
>> 
>> Given a text it wants to return a dictionary scoring the input against vital 
>> articles categories, e.g.:
>> 
>> out = wikimark("""Peter Hintjens wrote about the relation between technology 
>> and culture. Without using a scientifical tone of state-of-the-art review of 
>> the anthroposcene antropology, he gives a fair amount of food for thought. 
>> According to Hintjens, technology is doomed to become cheap. As matter of 
>> fact, intelligence tools will become more and more accessible which will 
>> trigger a revolution to rebalance forces in society.""") 
>> 
>> for category, score in out: 
>>    print('{} ~ {}'.format(category, score))
>> 
>> The above program would output something like that:
>> 
>> Art ~ 0.1 
>> Science ~ 0.5 
>> Society ~ 0.4
>> 
>> Except not everything went as planned. Mind the fact that in the above 
>> example the total is equal to 1, but I could not achieve that at all.
>> 
>> I am using gensim to compute vectors of paragraphs (doc2vev) and then submit 
>> those vectors to svm.SVR in a one-vs-all strategy ie. a document is scored 1 
>> if it's in that subcategory and zero otherwise. At prediction time, it goes 
>> though the same doc2vec pipeline. The computer will score each paragraph 
>> against the SVR models of wikipedia vital article subcategories and get a 
>> value between 0 and 1 for each paragraph. I compute the sum and group by 
>> subcategory and then I have a score per category for the input document
>> 
>> It somewhat works. I made a web ui online you can find it at 
>> https://sensimark.com where you can test it. You can directly access the
>> full api e.g. 
>> https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&all=1
>> 
>> The output JSON document is a list of category dictionary where the 
>> prediction key is associated with the average of the "prediction" of the 
>> subcategories. If you replace &all=1 by &top=5 you might get something else 
>> as top categories e.g. 
>> https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&top=10
>> 
>> or 
>> 
>> https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&top=5
>> 
>> I wrote "prediction" with double quotes because the value you see, is the 
>> result of some formula. Since, the predictions I get are rather small 
>> between 0 and 0.015 I apply the following formula:
>> value = math.exp(prediction)
>> magic = ((value * 100) - 110) * 100
>> 
>> In order to have values to spread between -200 and 200. Maybe this is the 
>> symptom that my model doesn't work at all. 
>> 
>> Still, the top 10 results are almost always near each other (try with BBC 
>> articles on https://sensimark.com . It is only when a regression model is 
>> disqualified with a score of 0 that the results are simple to understand. 
>> Sadly, I don't have an example at hand to support that claim. You have to 
>> believe me.
>> 
>> I just figured looking at the machine learning map that my problem might be 
>> classification problem, except I don't really want to know what is the class 
>> of new documents, I want to how what are the different subjects that are 
>> dealt in the document based on a hiearchical corpus;
>> I don't want to guess a hiearchy! I want to now how the document content 
>> spread over the different categories or subcategories.
>> 
>> I quickly read about multinomal regression, is it something do you recommend 
>> I use? Maybe you think about something else? 
>> 
>> Also, it seems I should benchmark / evaluate my model against LDA.
>> 
>> I am rather noob in terms of datascience and my math skills are not so 
>> fresh. I more likely looking for ideas on what algorithm, fine tuning and 
>> some practice of datascience I must follow that doesn't involve writing my 
>> own algorithm.
>> 
>> Thanks in advance!
>> _______________________________________________
>> scikit-learn mailing list
>> [email protected]
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Supervised prediction of multiple scores for a document

Reply via email to