Skip, Just in case you are interested in a very very small part of NLP implemented in J, you may take a look at my latest article/paper in the Journal of J.
http://www.journalofj.com/index.php/vol-5-no-1-august-2017 It's about word vectorization, and the implementation of the Term-Frequency Inverse-Document-Frequency model to calculate the cosine similarity between two documents. Best, Martin > Date: Wed, 8 Nov 2017 02:41:05 -0600 > From: "'Skip Cave' via Programming" <[email protected]> > To: "[email protected]" <[email protected]> > Subject: [Jprogramming] Natural Language Processing > Message-ID: > <caj8lg_fs1jfrnlr70nab1cq5c3u6yaf6iwervcxzuxyuhhs...@mail.gmail.com> > Content-Type: text/plain; charset="UTF-8" > > Natural Language Processing is one of the hottest fields in > programming today. Recent machine learning and neural network advances have > made significant improvements in all aspects of NLP. Speech Recognition, > Speech Synthesis, Knowledge Extraction, and Natural Language Understanding > have all improved dramatically, just within the last few years.. > > Conversational AI devices like Amazon's Echo (Alexa) and Google Home are > showing up in homes everywhere. Conversational software applications such > as Google's Assistant (Android), Microsoft's Cortana (Windows), and Apple's > Siri (iOS) are on every phone and PC. > > There are lots of open-source NLP toolkits available to help one build > these conversational apps. They are written in various languages: > > - > > Natural Language Toolkit (Python) - http://www.nltk.org/ and > https://github.com/nltk > - > > The Stanford NLP Group (Java) - https://nlp.stanford.edu/software/ and > https://stanfordnlp.github.io/CoreNLP/ > - > > Apache Open NLP - http://opennlp.apache.org/ > - > > CRAN NLP (in R) https://cran.r-project.org/web/packages/NLP/index.html > > > Two of the newest algorithms used for extracting meaning from text are > word2vec & doc2vec (doc2vec is also called Paragraph Vectors). Both of > these algorithms use a technique called "word embeddings" to encode > individual words. This is particularly interesting because the algorithms > are able to extract significant information from unstructured text by > simply analyzing word sequences and the probabilistic relationships between > neighboring words in any text. > > NLP processes are by nature highly parallel array oriented processes, > dealing with strings and arrays of words. Word 2vec and doc2vec are typical > in this regard. Both of these algorithms encode words (word2vec) and > sentences (doc2vec) in a multi-dimensional space (usually 100-200 > dimensions) where machine learning techniques will then cause similar words > and concepts to gravitate into clusters in the multi-dimensional space. > > Here's an overview of how Word2vec extracts meaning from text: > The amazing power of word vectors | the morning paper > <https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=8&ved=0ahUKEwihvviiwq7XAhUPymMKHbjzClUQFghcMAc&url=https%3A%2F%2Fblog.acolyer.org%2F2016%2F04%2F21%2Fthe-amazing-power-of-word-vectors%2F&usg=AOvVaw0yR2dyckfOJVIRyxp8rqc0> > > Python seems to be the popular language for coding these algorithms, though > it is not particularly noted for its array-handling properties. It would > seem that array oriented languages such as J would be more suited to > implementing word2vec and doc2vec. > > Here are implementations of both algorithms in Python: > > - > > Word2vec (Python) https://radimrehurek.com/gensim/models/word2vec.html > https://github.com/danielfrg/word2vec > - > > Doc2Vec (Paragraph Vectors) (Python) https://github.com/jhlau/doc2vec > > > How hard would it be to implement these two algorithms in J? I don't know > Python, so I can't judge the complexity. > > Skip ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
