Martin,

Your article in the Journal of J was exactly what I was looking for! That
article provides a great introduction into the world of Natural Language
Processing using J. The general scope of Machine Learning technology is
quite broad, going from face recognition to extracting patterns from stock
prices. However, my focus is currently on the analysis of human language,
(primarily English) to be able to extract meaning from text.

The algorithm you used, Term Frequency-Inverse Document Frequency
<https://deeplearning4j.org/bagofwords-tf-idf#term-frequency-inverse-document-frequency-tf-idf>
or
TF-IDF, is an an improvement over the basic Bag of Words count approach to
finding similarities between documents. TF-IDF takes the inverse frequency
of words in documents onto account. It assumes that the more times a word
appears in the document, the less likely that word will be useful in
finding similar documents. Thus words like "the"and "and" are heavily
discounted as possible indicators of matching documents.

Since TF-IDF was proposed by Karen Spärck Jones
<https://en.wikipedia.org/wiki/Karen_Sp%C3%A4rck_Jones> in 1972, Ii has
been used in all kinds of applications.  It is often used as a weighting
factor <https://en.wikipedia.org/wiki/Weighting_factor> in searches of
information retrieval, text mining
<https://en.wikipedia.org/wiki/Text_mining>, and user modeling
<https://en.wikipedia.org/wiki/User_modeling>. The tf-idf value increases
proportionally <https://en.wikipedia.org/wiki/Proportionality_(mathematics)>
 to the number of times a word appears in the document, but is often offset
by the frequency of the word in the corpus, which helps to adjust for the
fact that some words appear more frequently in general. Nowadays, tf-idf is
one of the most popular term-weighting schemes. For instance, 83% of
text-based recommender systems in the domain of digital libraries use
tf-idf.[2] <https://en.wikipedia.org/wiki/Tf%E2%80%93idf#cite_note-2>

Variations of the tf–idf weighting scheme are often used by search engines
<https://en.wikipedia.org/wiki/Search_engine> as a central tool in scoring
and ranking a document's relevance
<https://en.wikipedia.org/wiki/Relevance_(information_retrieval)> given a
user query <https://en.wikipedia.org/wiki/Information_retrieval>. tf–idf
can be successfully used for stop-words
<https://en.wikipedia.org/wiki/Stop-words> filtering in various subject
fields, including text summarization
<https://en.wikipedia.org/wiki/Automatic_summarization> and classification.
​
 (Wikipedia <https://goo.gl/TuUeD2>)​

​Subsequently, TF-IDF was improved again by simply doing a dimensionality
reduction on the extremely-long word vectors if TF IDF, and this approach
was named Latent Semantic Analysis or LSA https://goo.gl/LnZoyJ
<https://goo.gl/LnZoyJ>

Recently, a major
​advance
​in
extracting
meaning​
​from words was developed b
y
Mikolov et al.
​in ​
2013
​. Mikolov​
​ took
 into account ​the occurrence of
​the ​
neighboring word
​s that surround each target word in a document. ​
​Here's a description of the scheme, called word2vec, with links to the
pertinent set of Mikolov's papers, as well as a good overview of the
technique.
​ <https://goo.gl/LnZoyJ>​
<https://mail.google.com/mail/u/0/%E2%80%8Bhttps://goo.gl/LnZoyJ%E2%80%8B>
https://goo.gl/LnZoyJ
<https://mail.google.com/mail/u/0/%E2%80%8Bhttps://goo.gl/LnZoyJ%E2%80%8B>

​In 2014 Mikolov​
​
​ introduced "Paragraph Vectors <https://goo.gl/8mCviH>" which looked at
word sequences in whole sentences, paragraphs or whole documents, rather
than just single word neighbors. This scheme turned out to work much better
even than word2vec when trying to find similarity between multiple
documents. The name Paragraph Vector was soon changed to "doc2vec" as that
name captures the essential similarity and difference between the two
techniques. Currently, doc2vec seems to be the best unsupervised method for
finding similarities between blocks of text. To improve on doc2vec
matching, one must build hand-tweaked ontologies of the documents, which is
generally a complex and tedious manual task.

There have been several Python implementations of both word2vec and
doc2vec. Here's an open-source version of word2vec
​
https://goo.gl/UgXY31
​ <https://goo.gl/UgXY31>  and ​doc2vec  <https://goo.gl/fNxYmF>
https://goo.gl/fNxYmF
​ <https://goo.gl/fNxYmF>   Both are in Python.​

I'm hoping to get a J version running of Doc2vec at some point.

Skip  ​






Skip Cave
Cave Consulting LLC

On Thu, Nov 9, 2017 at 2:37 PM, Martin Saurer <[email protected]>
wrote:

> Skip,
>
> Just in case you are interested in a very very small part
> of NLP implemented in J, you may take a look at my latest
> article/paper in the Journal of J.
>
> http://www.journalofj.com/index.php/vol-5-no-1-august-2017
>
> It's about word vectorization, and the implementation of
> the Term-Frequency Inverse-Document-Frequency model to
> calculate the cosine similarity between two documents.
>
> Best,
>
> Martin
>
>
> > Date: Wed, 8 Nov 2017 02:41:05 -0600
> > From: "'Skip Cave' via Programming" <[email protected]>
> > To: "[email protected]" <[email protected]>
> > Subject: [Jprogramming] Natural Language Processing
> > Message-ID:
> >   <caj8lg_fs1jfrnlr70nab1cq5c3u6yaf6iwervcxzuxyuhhs...@mail.gmail.com>
> > Content-Type: text/plain; charset="UTF-8"
> >
> > Natural Language Processing is one of the hottest fields in
> > programming today. Recent machine learning and neural network advances
> have
> > made significant improvements in all aspects of NLP. Speech Recognition,
> > Speech Synthesis, Knowledge Extraction, and Natural Language
> Understanding
> > have all improved  dramatically, just within the last few years..
> >
> > Conversational AI devices like Amazon's Echo (Alexa) and Google Home are
> > showing up in homes everywhere. Conversational software applications such
> > as Google's Assistant (Android), Microsoft's Cortana (Windows), and
> Apple's
> > Siri (iOS) are on every phone and PC.
> >
> > There are lots of open-source NLP toolkits available to help one build
> > these conversational apps. They are written in various languages:
> >
> >  -
> >
> >  Natural Language Toolkit (Python) - http://www.nltk.org/  and
> >  https://github.com/nltk
> >  -
> >
> >  The Stanford NLP Group (Java) - https://nlp.stanford.edu/software/ and
> >  https://stanfordnlp.github.io/CoreNLP/
> >  -
> >
> >  Apache Open NLP - http://opennlp.apache.org/
> >  -
> >
> >  CRAN NLP (in R) https://cran.r-project.org/web/packages/NLP/index.html
> >
> >
> > Two of the newest algorithms used for extracting meaning from text are
> > word2vec & doc2vec (doc2vec is also called Paragraph Vectors). Both of
> > these algorithms use a technique called "word embeddings" to encode
> > individual words. This is particularly interesting because the algorithms
> > are able to extract significant information from unstructured text by
> > simply analyzing word sequences and the probabilistic relationships
> between
> > neighboring words in any text.
> >
> > NLP processes are by nature highly parallel array oriented processes,
> > dealing with strings and arrays of words. Word 2vec and doc2vec are
> typical
> > in this regard. Both of these algorithms encode words (word2vec) and
> > sentences (doc2vec) in a multi-dimensional space (usually 100-200
> > dimensions) where machine learning techniques will then cause similar
> words
> > and concepts to gravitate into clusters in the multi-dimensional space.
> >
> > Here's an overview of how Word2vec extracts meaning from text:
> > The amazing power of word vectors | the morning paper
> > <https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&;
> cd=8&ved=0ahUKEwihvviiwq7XAhUPymMKHbjzClUQFghcMAc&url=https%
> 3A%2F%2Fblog.acolyer.org%2F2016%2F04%2F21%2Fthe-
> amazing-power-of-word-vectors%2F&usg=AOvVaw0yR2dyckfOJVIRyxp8rqc0>
> >
> > Python seems to be the popular language for coding these algorithms,
> though
> > it is not particularly noted for its array-handling properties.  It would
> > seem that array oriented languages such as J would be more suited to
> > implementing word2vec and doc2vec.
> >
> > Here are implementations of both algorithms in Python:
> >
> >  -
> >
> >  Word2vec (Python)   https://radimrehurek.com/gens
> im/models/word2vec.html
> >    https://github.com/danielfrg/word2vec
> >  -
> >
> >  Doc2Vec (Paragraph Vectors) (Python)   https://github.com/jhlau/doc2vec
> >
> >
> > How hard would it be to implement these two algorithms in J? I don't know
> > Python, so I can't judge the complexity.
> >
> > Skip
>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to