Github user myui commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/116#discussion_r141544983
--- Diff: docs/gitbook/embedding/word2vec.md ---
@@ -0,0 +1,399 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+-->
+
+Word Embedding is a powerful tool for many tasks,
+e.g. finding similar words,
+feature vectors for supervised machine learning task and word analogy,
+such as `king - man + woman =~ queen`.
+In word embedding,
+each word represents a low dimension and dense vector.
+**Skip-Gram** and **Continuous Bag-of-words** (CBoW) are the most popular
algorithms to obtain good word embeddings (a.k.a word2vec).
+
+The papers introduce the method are as follows:
+
+- T. Mikolov, et al., [Distributed Representations of Words and Phrases
and Their Compositionality
+](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf).
NIPS, 2013.
+- T. Mikolov, et al., [Efficient Estimation of Word Representations in
Vector Space](https://arxiv.org/abs/1301.3781). ICLR, 2013.
+
+Hivemall provides two type algorithms: Skip-gram and CBoW with negative
sampling.
+Hivemall enables you to train your sequence data such as,
+but not limited to, documents based on word2vec.
+This article gives usage instructions of the feature.
+
+<!-- toc -->
+
+> #### Note
+> This feature is supported from Hivemall v0.5-rc.? or later.
+
+# Prepare document data
+
+Assume that you already have `docs` table which contains many documents as
string format with unique index:
+
+```sql
+select * FROM docs;
+```
+
+| docId | doc |
+|:----: |:----|
+| 0 | "Alice was beginning to get very tired of sitting by her sister
on the bank ..." |
+| ... | ... |
+
+First, each document is split into words by tokenize function like a
[`tokenize`](../misc/tokenizer.html).
+
+```sql
+drop table docs_words;
+create table docs_words as
+ select
+ docid,
+ tokenize(doc, true) as words
+ FROM
+ docs
+;
+```
+
+This table shows tokenized document.
+
+| docId | doc |
+|:----: |:----|
+| 0 | ["alice", "was", "beginning", "to", "get", "very", "tired",
"of", "sitting", "by", "her", "sister", "on", "the", "bank", ...] |
+| ... | ... |
+
+Then, you count frequency up per word and remove low frequency words from
the vocabulary.
+To remove low frequency words is optional preprocessing, but this process
is effective to train word vector fastly.
+
+```sql
+set hivevar:mincount=5;
+
+drop table freq;
+create table freq as
+select
+ row_number() over () - 1 as wordid,
+ word,
+ freq
+from (
+ select
+ word,
+ COUNT(*) as freq
+ from
+ docs_words
+ LATERAL VIEW explode(words) lTable as word
+ group by
+ word
+) t
+where freq >= ${mincount}
+;
+```
+
+Hivemall's word2vec supports two type words; string and int.
+String type tends to use huge memory during training.
+On the other hand, int type tends to use less memory.
+If you train on small dataset, we recommend using string type,
+because memory usage can be ignored and HiveQL is more simple.
+If you train on large dataset, we recommend using int type,
+because it saves memory during training.
+
+# Create sub-sampling table
+
+Sub-sampling table is stored a sub-sampling probability per word.
+
+The sub-sampling probability of word $$w_i$$ is computed by the following
equation:
+
+$$
+\begin{aligned}
+f(w_i) = \sqrt{\frac{\mathrm{sample}}{freq(w_i)/\sum freq(w)}} +
\frac{\mathrm{sample}}{freq(w_i)/\sum freq(w)}
+\end{aligned}
+$$
+
+During word2vec training,
+not sub-sampled words are ignored.
+It works to train fastly and to consider the imbalance the rare words and
frequent words by reducing frequent words.
+The smaller `sample` value set,
+the fewer words are used during training.
+
+```sql
+set hivevar:sample=1e-4;
+
+drop table subsampling_table;
+create table subsampling_table as
+with stats as (
+ select
+ sum(freq) as numTrainWords
+ FROM
+ freq
+)
+select
+ l.wordid,
+ l.word,
+ sqrt(${sample}/(l.freq/r.numTrainWords)) +
${sample}/(l.freq/r.numTrainWords) as p
+from
+ freq l
+cross join
+ stats r
+;
+```
+
+```sql
+select * FROM subsampling_table order by p;
+```
+
+| wordid | word | p |
+|:----: | :----: |:----:|
+| 48645 | the | 0.04013665|
+| 11245 | of | 0.052463654|
+| 16368 | and | 0.06555538|
+| 61938 | 00 | 0.068162076|
+| 19977 | in | 0.071441144|
+| 83599 | 0 | 0.07528994|
+| 95017 | a | 0.07559573|
+| 1225 | to | 0.07953133|
+| 37062 | 0000 | 0.08779001|
+| 58246 | is | 0.09049763|
+| ... | ... |... |
+
+The first row shows that 4% of `the` are used in the documents during
training.
+
+# Delete low frequency words and high frequency words from `docs_words`
+
+To reduce useless words from corpus,
+low frequency words and high frequency words are deleted.
+And, to avoid loading long document on memory, a document is split into
some sub-documents.
+
+```sql
+set hivevar:maxlength=1500;
+SET hivevar:seed=31;
+
+drop table train_docs;
+create table train_docs as
+ with docs_exploded as (
+ select
+ docid,
+ word,
+ pos % ${maxlength} as pos,
+ pos div ${maxlength} as splitid,
+ rand(${seed}) as rnd
+ from
+ docs_words LATERAL VIEW posexplode(words) t as pos, word
+ )
+select
+ l.docid,
+ -- to_ordered_list(l.word, l.pos) as words
+ to_ordered_list(r2.wordid, l.pos) as words,
+from
+ docs_exploded l
+ LEFT SEMI join freq r on (l.word = r.word)
+ join subsampling_table r2 on (l.word = r2.word)
+where
+ r2.p > l.rnd
+group by
+ l.docid, l.splitid
+;
+```
+
+If you store string word in `train_docs` table,
+please replace `to_ordered_list(r2.wordid, l.pos) as words` with
`to_ordered_list(l.word, l.pos) as words`.
+
+# Create negative sampling table
+
+Negative sampling is an approximate function of [softmax
function](https://en.wikipedia.org/wiki/Softmax_function).
+Here, `negative_table` is used to store word sampling probability for
negative sampling.
+`z` is a hyperparameter of noise distribution for negative sampling.
+During word2vec training,
--- End diff --
Line break is not needed. Line break after `,` is unreasonable (elsewhere
as well).
---