Héllo, I started a natural language processing project a few weeks ago called wikimark <https://github.com/amirouche/wikimark/> (the code is all in wikimark.py <https://github.com/amirouche/wikimark/blob/master/wikimark.py#L1>)
Given a text it wants to return a dictionary scoring the input against vital articles categories <https://en.wikipedia.org/api/rest_v1/page/html/Wikipedia%3AVital_articles%2FLevel%2F5>, e.g.: out = wikimark("""Peter Hintjens wrote about the relation between technology and culture. Without using a scientifical tone of state-of-the-art review of the anthroposcene antropology, he gives a fair amount of food for thought. According to Hintjens, technology is doomed to become cheap. As matter of fact, intelligence tools will become more and more accessible which will trigger a revolution to rebalance forces in society.""") for category, score in out: print('{} ~ {}'.format(category, score)) The above program would output something like that: Art ~ 0.1 Science ~ 0.5 Society ~ 0.4 Except not everything went as planned. Mind the fact that in the above example the total is equal to 1, but I could not achieve that at all. I am using gensim to compute vectors of paragraphs (doc2vev) and then submit those vectors to svm.SVR in a one-vs-all strategy ie. a document is scored 1 if it's in that subcategory and zero otherwise. At prediction time, it goes though the same doc2vec pipeline. The computer will score *each paragraph* against the SVR models of wikipedia vital article subcategories and get a value between 0 and 1 for *each paragraph*. I compute the sum and group by subcategory and then I have a score per category for the input document It somewhat works. I made a web ui online you can find it at https://sensimark.com where you can test it. You can directly access the full api e.g. https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&all=1 The output JSON document is a list of category dictionary where the prediction key is associated with the average of the "prediction" of the subcategories. If you replace &all=1 by &top=5 you might get something else as top categories e.g. https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&top=10 <https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&all=1> or https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&top=5 I wrote "prediction" with double quotes because the value you see, is the result of some formula. Since, the predictions I get are rather small between 0 and 0.015 I apply the following formula: value = math.exp(prediction) magic = ((value * 100) - 110) * 100 In order to have values to spread between -200 and 200. Maybe this is the symptom that my model doesn't work at all. Still, the top 10 results are almost always near each other (try with BBC <http://www.bbc.com/> articles on https://sensimark.com . It is only when a regression model is disqualified with a score of 0 that the results are simple to understand. Sadly, I don't have an example at hand to support that claim. You have to believe me. I just figured looking at the machine learning map <http://scikit-learn.org/stable/tutorial/machine_learning_map/> that my problem might be classification problem, except I don't really want to know what is *the* class of new documents, I want to how what are the different subjects that are dealt in the document based on a hiearchical corpus; I don't want to guess a hiearchy! I want to now how the document content spread over the different categories or subcategories. I quickly read about multinomal regression, is it something do you recommend I use? Maybe you think about something else? Also, it seems I should benchmark / evaluate my model against LDA. I am rather noob in terms of datascience and my math skills are not so fresh. I more likely looking for ideas on what algorithm, fine tuning and some practice of datascience I must follow that doesn't involve writing my own algorithm. Thanks in advance!
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn