Fixed term vector space tutorial
Project: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/commit/62a97798 Tree: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/tree/62a97798 Diff: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/diff/62a97798 Branch: refs/heads/master Commit: 62a97798bbab688d0f24f5126c755c67209f31af Parents: b97af4f Author: Makoto Yui <[email protected]> Authored: Sat Nov 3 16:38:47 2018 +0900 Committer: Makoto Yui <[email protected]> Committed: Sat Nov 3 16:38:47 2018 +0900 ---------------------------------------------------------------------- docs/gitbook/SUMMARY.md | 5 +++-- docs/gitbook/ft_engineering/bm25.md | 24 +++++++++++++++++++++++- docs/gitbook/ft_engineering/term_vector.md | 3 +++ 3 files changed, 29 insertions(+), 3 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/62a97798/docs/gitbook/SUMMARY.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/SUMMARY.md b/docs/gitbook/SUMMARY.md index 3484bfb..31a0311 100644 --- a/docs/gitbook/SUMMARY.md +++ b/docs/gitbook/SUMMARY.md @@ -65,8 +65,9 @@ * [Feature Transformation](ft_engineering/ft_trans.md) * [Feature vectorization](ft_engineering/vectorization.md) * [Quantify non-number features](ft_engineering/quantify.md) -* [TF-IDF Calculation](ft_engineering/tfidf.md) -* [BM25](ft_engineering/bm25.md) +* [Term Vector Model](ft_engineering/term_vector.md) + * [TF-IDF Term Weighting](ft_engineering/tfidf.md) + * [Okapi BM25 Term Weighting](ft_engineering/bm25.md) ## Part IV - Evaluation http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/62a97798/docs/gitbook/ft_engineering/bm25.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/ft_engineering/bm25.md b/docs/gitbook/ft_engineering/bm25.md index 4ca029f..b70ecfe 100644 --- a/docs/gitbook/ft_engineering/bm25.md +++ b/docs/gitbook/ft_engineering/bm25.md @@ -139,7 +139,29 @@ from ; ``` -## Show important terms +## Hyperparameters + +`bm25()`'s function signature and hyperparameters are as follows: + +```sql +hive> select bm25(); +FAILED: SemanticException Line 1:7 Wrong arguments 'bm25': + +#arguments must be greater than or equal to 5: 0 + +usage: bm25(double termFrequency, int docLength, double avgDocLength, int + numDocs, int numDocsWithTerm [, const string options]) - Return an + Okapi BM25 score in double [-b <arg>] [-d <arg>] [-k1 <arg>] + [-min_idf <arg>] + -b <arg> Hyperparameter with type double in range 0.0 + and 1.0 [default: 0.75] + -d,--delta <arg> Hyperparameter delta of BM25+ [default: 0.0] + -k1 <arg> Hyperparameter with type double, usually in + range 1.2 and 2.0 [default: 1.2] + -min_idf,--epsilon <arg> Hyperparameter delta of BM25+ [default: 1e-8] +``` + +## Show important terms for each document ```sql select http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/62a97798/docs/gitbook/ft_engineering/term_vector.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/ft_engineering/term_vector.md b/docs/gitbook/ft_engineering/term_vector.md new file mode 100644 index 0000000..ff8c61f --- /dev/null +++ b/docs/gitbook/ft_engineering/term_vector.md @@ -0,0 +1,3 @@ +Term vector model or [Vector space model](https://en.wikipedia.org/wiki/Vector_space_model) is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers. + +It is used in information filtering, information retrieval, relevancy rankings, and machine learning.
