Natural Language Processing is one of the hottest fields in programming today. Recent machine learning and neural network advances have made significant improvements in all aspects of NLP. Speech Recognition, Speech Synthesis, Knowledge Extraction, and Natural Language Understanding have all improved dramatically, just within the last few years..
Conversational AI devices like Amazon's Echo (Alexa) and Google Home are showing up in homes everywhere. Conversational software applications such as Google's Assistant (Android), Microsoft's Cortana (Windows), and Apple's Siri (iOS) are on every phone and PC. There are lots of open-source NLP toolkits available to help one build these conversational apps. They are written in various languages: - Natural Language Toolkit (Python) - http://www.nltk.org/ and https://github.com/nltk - The Stanford NLP Group (Java) - https://nlp.stanford.edu/software/ and https://stanfordnlp.github.io/CoreNLP/ - Apache Open NLP - http://opennlp.apache.org/ - CRAN NLP (in R) https://cran.r-project.org/web/packages/NLP/index.html Two of the newest algorithms used for extracting meaning from text are word2vec & doc2vec (doc2vec is also called Paragraph Vectors). Both of these algorithms use a technique called "word embeddings" to encode individual words. This is particularly interesting because the algorithms are able to extract significant information from unstructured text by simply analyzing word sequences and the probabilistic relationships between neighboring words in any text. NLP processes are by nature highly parallel array oriented processes, dealing with strings and arrays of words. Word 2vec and doc2vec are typical in this regard. Both of these algorithms encode words (word2vec) and sentences (doc2vec) in a multi-dimensional space (usually 100-200 dimensions) where machine learning techniques will then cause similar words and concepts to gravitate into clusters in the multi-dimensional space. Here's an overview of how Word2vec extracts meaning from text: The amazing power of word vectors | the morning paper <https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=8&ved=0ahUKEwihvviiwq7XAhUPymMKHbjzClUQFghcMAc&url=https%3A%2F%2Fblog.acolyer.org%2F2016%2F04%2F21%2Fthe-amazing-power-of-word-vectors%2F&usg=AOvVaw0yR2dyckfOJVIRyxp8rqc0> Python seems to be the popular language for coding these algorithms, though it is not particularly noted for its array-handling properties. It would seem that array oriented languages such as J would be more suited to implementing word2vec and doc2vec. Here are implementations of both algorithms in Python: - Word2vec (Python) https://radimrehurek.com/gensim/models/word2vec.html https://github.com/danielfrg/word2vec - Doc2Vec (Paragraph Vectors) (Python) https://github.com/jhlau/doc2vec How hard would it be to implement these two algorithms in J? I don't know Python, so I can't judge the complexity. Skip ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm