+1 to merge this when it implements the Document Categorizer, then we can also use those tools to train and evaluate it
Jörn On Wed, Jul 5, 2017 at 9:28 AM, Rodrigo Agerri <[email protected]> wrote: > Hello again, > > @Thamme, out of curiosity, do you have evaluation numbers on the > Stanford Large Movie Review dataset? > > Best, > > Rodrigo > > On Wed, Jul 5, 2017 at 9:25 AM, Rodrigo Agerri <[email protected]> wrote: >> +1 to Tommaso's comment. This would be very nice to have in the project. >> >> R >> >> On Wed, Jul 5, 2017 at 9:19 AM, Tommaso Teofili >> <[email protected]> wrote: >>> thanks Thamme for bringing this to the list! >>> >>> >>> Il giorno mer 5 lug 2017 alle ore 03:49 Thamme Gowda <[email protected]> ha >>> scritto: >>> >>>> Hello OpenNLP Devs, >>>> >>>> I am working with text classification using word embeddings like >>>> Gloves/Word2Vec and LSTM networks. >>>> It will be interesting to see if we can use it as document categorizer, >>>> especially for sentiment analysis in OpenNLP. >>>> >>>> I have already raised a PR to the sandbox repo - >>>> https://github.com/apache/opennlp-sandbox/pull/3 >>>> >>>> This is first version, and I expect to receive feedback from Dev community >>>> to make it work for everyone. >>>> >>>> Here are the design choices I have made for the initial version: >>>> >>>> - Using pre-trained Gloves - I felt the glove vector format is clean, >>>> easily customizable in terms of dimensions and vocabulary size, and >>>> (also I >>>> have been reading a lot about them from Stanford NLP group). >>>> - Training Gloves isnt hard either, we can do it using the original C >>>> library as well as by using DL4J. >>>> - Using DL4J's Multi layer networks with LSTM instead of reinventing >>>> this stuff again on JVM for OpenNLP >>>> >>>> >>>> Please share your feedback here or on the github page >>>> https://github.com/apache/opennlp-sandbox/pull/3 . >>>> >>>> >>> I think the approach outlined here sounds good, I think we could >>> incorporate the PR as soon as it implements the Doccat API. >>> Then we may see whether and how it makes sense to adjust it to use other >>> types of embeddings (e.g. paragraph vectors) and / or different network >>> setups (e.g. more hidden layers, bidirectionalLSTM, etc.). >>> >>> Looking forward to see this move forward, >>> Regards, >>> Tommaso >>> >>> >>>> >>>> Thanks, >>>> TG >>>> >>>> >>>> -- >>>> *Thamme Gowda * >>>> @thammegowda <https://twitter.com/thammegowda> | >>>> http://scf.usc.edu/~tnarayan/ >>>> ~Sent via somebody's Webmail server >>>>
