http://git-wip-us.apache.org/repos/asf/incubator-hivemall-site/blob/68241a08/userguide/getting_started/input-format.html ---------------------------------------------------------------------- diff --git a/userguide/getting_started/input-format.html b/userguide/getting_started/input-format.html index 8e7e876..d1236c7 100644 --- a/userguide/getting_started/input-format.html +++ b/userguide/getting_started/input-format.html @@ -999,6 +999,21 @@ </li> + <li class="chapter " data-level="5.6" data-path="../binaryclass/titanic_rf.html"> + + <a href="../binaryclass/titanic_rf.html"> + + + <b>5.6.</b> + + Kaggle Titanic Tutorial + + </a> + + + + </li> + @@ -1678,11 +1693,11 @@ Here, we use <a href="http://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_ </div><!-- tocstop --> <h1 id="input-format-for-classification">Input Format for Classification</h1> -<p>The classifiers of Hivemall takes 2 (or 3) arguments: <em>features</em>, <em>label</em>, and <em>options</em> (a.k.a. <a href="http://en.wikipedia.org/wiki/Hyperparameter" target="_blank">hyperparameters</a>). The first two arguments of training functions (e.g., <a href="https://github.com/myui/hivemall/wiki/a9a-binary-classification-(logistic-regression" target="_blank">logress</a>) and <a href="https://github.com/myui/hivemall/wiki/news20-binary-classification-%232-(CW,-AROW,-SCW" target="_blank">train_scw</a>)) represents training examples. </p> +<p>The classifiers of Hivemall takes 2 (or 3) arguments: <em>features</em>, <em>label</em>, and <em>options</em> (a.k.a. <a href="http://en.wikipedia.org/wiki/Hyperparameter" target="_blank">hyperparameters</a>). The first two arguments of training functions represents training examples. </p> <p>In Statistics, <em>features</em> and <em>label</em> are called <a href="http://www.oswego.edu/~srp/stats/variable_types.htm" target="_blank">Explanatory variable and Response Variable</a>, respectively.</p> <h1 id="features-format-for-classification-and-regression">Features format (for classification and regression)</h1> <p>The format of <em>features</em> is common between (binary and multi-class) classification and regression. -Hivemall accepts ARRAY<INT|BIGINT|TEXT> for the type of <em>features</em> column.</p> +Hivemall accepts <code>ARRAY&lt;INT|BIGINT|TEXT></code> for the type of <em>features</em> column.</p> <p>Hivemall uses a <em>sparse</em> data format (cf. <a href="http://netlib.org/linalg/html_templates/node91.html" target="_blank">Compressed Row Storage</a>) which is similar to <a href="http://stackoverflow.com/questions/12112558/read-write-data-in-libsvm-format" target="_blank">LIBSVM</a> and <a href="https://github.com/JohnLangford/vowpal_wabbit/wiki/Input-format" target="_blank">Vowpal Wabbit</a>.</p> <p>The format of each feature in an array is as follows:</p> <pre><code>feature ::= <index>:<weight> or <index> @@ -1692,7 +1707,7 @@ weight ::= <FLOAT> </code></pre><p>The <em>index</em> are usually a number (INT or BIGINT) starting from 1. Here is an instance of a features.</p> <pre><code>10:3.4 123:0.5 34567:0.231 -</code></pre><p><em>Note:</em> As mentioned later, <em>index</em> "0" is reserved for a <a href="https://github.com/myui/hivemall/wiki/Using-explicit-addBias(" target="_blank">Bias/Dummy variable</a>-for-a-better-prediction).</p> +</code></pre><p><em>Note:</em> As mentioned later, <em>index</em> "0" is reserved for a <a href="../tips/addbias.html">Bias/Dummy variable</a>.</p> <p>In addition to numbers, you can use a TEXT value for an index. For example, you can use array("height:1.5", "length:2.0") for the features.</p> <pre><code>"height:1.5" "length:2.0" </code></pre><h2 id="quantitative-and-categorical-variables">Quantitative and Categorical variables</h2> @@ -1708,11 +1723,11 @@ Here is an instance of a features.</p> </code></pre><p>Note 1.0 is used for the weight when omitting <em>weight</em>. </p> <h2 id="biasdummy-variable-in-features">Bias/Dummy Variable in features</h2> <p>Note that "0" is reserved for a Bias variable (called dummy variable in Statistics). </p> -<p>The <a href="https://github.com/myui/hivemall/wiki/Using-explicit-addBias(" target="_blank">addBias</a>-for-a-better-prediction) function is Hivemall appends "0:1.0" as an element of array in <em>features</em>.</p> +<p>The <a href="../tips/addbias.html">addBias</a> function is Hivemall appends "0:1.0" as an element of array in <em>features</em>.</p> <h2 id="feature-hashing">Feature hashing</h2> -<p>Hivemall supports <a href="http://en.wikipedia.org/wiki/Feature_hashing" target="_blank">feature hashing/hashing trick</a> through <a href="https://github.com/myui/hivemall/wiki/KDDCup-2012-track-2-CTR-prediction-dataset#converting-feature-representation-by-feature-hashing" target="_blank">mhash function</a>.</p> +<p>Hivemall supports <a href="http://en.wikipedia.org/wiki/Feature_hashing" target="_blank">feature hashing/hashing trick</a> through <a href="../ft_engineering/hashing.html#mhash-function">mhash function</a>.</p> <p>The mhash function takes a feature (i.e., <em>index</em>) of TEXT format and generates a hash number of a range from 1 to 2^24 (=16777216) by the default setting.</p> -<p>Feature hashing is useful where the dimension of feature vector (i.e., the number of elements in <em>features</em>) is so large. Consider applying <a href="(https:/github.com/myui/hivemall/wiki/KDDCup-2012-track-2-CTR-prediction-dataset#converting-feature-representation-by-feature-hashing">mhash function</a>) when a prediction model does not fit in memory and OutOfMemory exception happens.</p> +<p>Feature hashing is useful where the dimension of feature vector (i.e., the number of elements in <em>features</em>) is so large. Consider applying <a href="(../ft_engineering/hashing.html#mhash-function">mhash function</a>) when a prediction model does not fit in memory and OutOfMemory exception happens.</p> <p>In general, you don't need to use mhash when the dimension of feature vector is less than 16777216. If feature <em>index</em> is very long TEXT (e.g., "xxxxxxx-yyyyyy-weight:55.3") and uses huge memory spaces, consider using mhash as follows:</p> <pre><code class="lang-sql"><span class="hljs-comment">-- feature is v0.3.2 or before</span> @@ -1725,7 +1740,7 @@ feature(mhash(extract_feature("xxxxxxx-yyyyyy-weight:55.3")), extract_ <p>43352:55.3</p> </blockquote> <h2 id="feature-normalization">Feature Normalization</h2> -<p>Feature (weight) normalization is important in machine learning. Please refer <a href="https://github.com/myui/hivemall/wiki/Feature-scaling" target="_blank">https://github.com/myui/hivemall/wiki/Feature-scaling</a> for detail.</p> +<p>Feature (weight) normalization is important in machine learning. Please refer <a href="../ft_engineering/scaling.html">this article</a> for detail.</p> <hr> <h1 id="label-format-in-binary-classification">Label format in Binary Classification</h1> <p>The <em>label</em> must be an <em>INT</em> typed column and the values are positive (+1) or negative (-1) as follows:</p> @@ -1815,7 +1830,25 @@ feature(mhash(extract_feature("xxxxxxx-yyyyyy-weight:55.3")), extract_ <span class="hljs-keyword">from</span> <span class="hljs-keyword">table</span>; </code></pre> -<p><div id="page-footer"><hr><p><sub><font color="gray"> +<p><div id="page-footer"><hr><!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> +<p><sub><font color="gray"> Apache Hivemall is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. </font></sub></p> </div></p> @@ -1852,7 +1885,7 @@ Apache Hivemall is an effort undergoing incubation at The Apache Software Founda <script> var gitbook = gitbook || []; gitbook.push(function() { - gitbook.page.hasChanged({"page":{"title":"Input Format","level":"1.2.3","depth":2,"next":{"title":"Tips for Effective Hivemall","level":"1.3","depth":1,"path":"tips/README.md","ref":"tips/README.md","articles":[{"title":"Explicit addBias() for better prediction","level":"1.3.1","depth":2,"path":"tips/addbias.md","ref":"tips/addbias.md","articles":[]},{"title":"Use rand_amplify() to better prediction results","level":"1.3.2","depth":2,"path":"tips/rand_amplify.md","ref":"tips/rand_amplify.md","articles":[]},{"title":"Real-time Prediction on RDBMS","level":"1.3.3","depth":2,"path":"tips/rt_prediction.md","ref":"tips/rt_prediction.md","articles":[]},{"title":"Ensemble learning for stable prediction","level":"1.3.4","depth":2,"path":"tips/ensemble_learning.md","ref":"tips/ensemble_learning.md","articles":[]},{"title":"Mixing models for a better prediction convergence (MIX server)","level":"1.3.5","depth":2,"path":"tips/mixserver.md","ref":"tips/mixserver.md","articles":[]},{ "title":"Run Hivemall on Amazon Elastic MapReduce","level":"1.3.6","depth":2,"path":"tips/emr.md","ref":"tips/emr.md","articles":[]}]},"previous":{"title":"Install as permanent functions","level":"1.2.2","depth":2,"path":"getting_started/permanent-functions.md","ref":"getting_started/permanent-functions.md","articles":[]},"dir":"ltr"},"config":{"plugins":["theme-api","edit-link","github","splitter","sitemap","etoc","callouts","toggle-chapters","anchorjs","codeblock-filename","expandable-chapters","multipart","codeblock-filename","katex","emphasize","localized-footer"],"styles":{"website":"styles/website.css","pdf":"styles/pdf.css","epub":"styles/epub.css","mobi":"styles/mobi.css","ebook":"styles/ebook.css","print":"styles/print.css"},"pluginsConfig":{"emphasize":{},"callouts":{},"etoc":{"maxdepth":3,"mindepth":1,"notoc":true},"github":{"url":"https://github.com/apache/incubator-hivemall/"},"splitter":{},"search":{},"downloadpdf":{"base":"https://github.com/apache/incubator-hivemall/ docs/gitbook","label":"PDF","multilingual":false},"multipart":{},"localized-footer":{"filename":"FOOTER.md"},"lunr":{"maxIndexSize":1000000,"ignoreSpecialCharacters":false},"katex":{},"fontsettings":{"theme":"white","family":"sans","size":2,"font":"sans"},"highlight":{},"codeblock-filename":{},"sitemap":{"hostname":"http://hivemall.incubator.apache.org/"},"theme-api":{"languages":[],"split":false,"theme":"dark"},"sharing":{"facebook":true,"twitter":true,"google":false,"weibo":false,"instapaper":false,"vk":false,"all":["facebook","google","twitter","weibo","instapaper"]},"edit-link":{"label":"Edit","base":"https://github.com/apache/incubator-hivemall/docs/gitbook"},"theme-default":{"styles":{"website":"styles/website.css","pdf":"styles/pdf.css","epub":"styles/epub.css","mobi":"styles/mobi.css","ebook":"styles/ebook.css","print":"styles/print.css"},"showLevel":true},"anchorjs":{"selector":"h1,h2,h3,*:not(.callout) > h4,h5"},"toggle-chapters":{},"expandable-chapters":{}},"theme":"defau lt","pdf":{"pageNumbers":true,"fontSize":12,"fontFamily":"Arial","paperSize":"a4","chapterMark":"pagebreak","pageBreaksBefore":"/","margin":{"right":62,"left":62,"top":56,"bottom":56}},"structure":{"langs":"LANGS.md","readme":"README.md","glossary":"GLOSSARY.md","summary":"SUMMARY.md"},"variables":{},"title":"Hivemall User Manual","links":{"sidebar":{"<i class=\"fa fa-home\"></i> Home":"http://hivemall.incubator.apache.org/"}},"gitbook":"3.x.x","description":"User Manual for Apache Hivemall"},"file":{"path":"getting_started/input-format.md","mtime":"2016-11-12T07:18:00.000Z","type":"markdown"},"gitbook":{"version":"3.2.2","time":"2016-11-14T10:40:22.987Z"},"basePath":"..","book":{"language":""}}); + gitbook.page.hasChanged({"page":{"title":"Input Format","level":"1.2.3","depth":2,"next":{"title":"Tips for Effective Hivemall","level":"1.3","depth":1,"path":"tips/README.md","ref":"tips/README.md","articles":[{"title":"Explicit addBias() for better prediction","level":"1.3.1","depth":2,"path":"tips/addbias.md","ref":"tips/addbias.md","articles":[]},{"title":"Use rand_amplify() to better prediction results","level":"1.3.2","depth":2,"path":"tips/rand_amplify.md","ref":"tips/rand_amplify.md","articles":[]},{"title":"Real-time Prediction on RDBMS","level":"1.3.3","depth":2,"path":"tips/rt_prediction.md","ref":"tips/rt_prediction.md","articles":[]},{"title":"Ensemble learning for stable prediction","level":"1.3.4","depth":2,"path":"tips/ensemble_learning.md","ref":"tips/ensemble_learning.md","articles":[]},{"title":"Mixing models for a better prediction convergence (MIX server)","level":"1.3.5","depth":2,"path":"tips/mixserver.md","ref":"tips/mixserver.md","articles":[]},{ "title":"Run Hivemall on Amazon Elastic MapReduce","level":"1.3.6","depth":2,"path":"tips/emr.md","ref":"tips/emr.md","articles":[]}]},"previous":{"title":"Install as permanent functions","level":"1.2.2","depth":2,"path":"getting_started/permanent-functions.md","ref":"getting_started/permanent-functions.md","articles":[]},"dir":"ltr"},"config":{"plugins":["theme-api","edit-link","github","splitter","sitemap","etoc","callouts","toggle-chapters","anchorjs","codeblock-filename","expandable-chapters","multipart","codeblock-filename","katex","emphasize","localized-footer"],"styles":{"website":"styles/website.css","pdf":"styles/pdf.css","epub":"styles/epub.css","mobi":"styles/mobi.css","ebook":"styles/ebook.css","print":"styles/print.css"},"pluginsConfig":{"emphasize":{},"callouts":{},"etoc":{"maxdepth":3,"mindepth":1,"notoc":true},"github":{"url":"https://github.com/apache/incubator-hivemall/"},"splitter":{},"search":{},"downloadpdf":{"base":"https://github.com/apache/incubator-hivemall/ docs/gitbook","label":"PDF","multilingual":false},"multipart":{},"localized-footer":{"filename":"FOOTER.md"},"lunr":{"maxIndexSize":1000000,"ignoreSpecialCharacters":false},"katex":{},"fontsettings":{"theme":"white","family":"sans","size":2,"font":"sans"},"highlight":{},"codeblock-filename":{},"sitemap":{"hostname":"http://hivemall.incubator.apache.org/"},"theme-api":{"languages":[],"split":false,"theme":"dark"},"sharing":{"facebook":true,"twitter":true,"google":false,"weibo":false,"instapaper":false,"vk":false,"all":["facebook","google","twitter","weibo","instapaper"]},"edit-link":{"label":"Edit","base":"https://github.com/apache/incubator-hivemall/docs/gitbook"},"theme-default":{"styles":{"website":"styles/website.css","pdf":"styles/pdf.css","epub":"styles/epub.css","mobi":"styles/mobi.css","ebook":"styles/ebook.css","print":"styles/print.css"},"showLevel":true},"anchorjs":{"selector":"h1,h2,h3,*:not(.callout) > h4,h5"},"toggle-chapters":{},"expandable-chapters":{}},"theme":"defau lt","pdf":{"pageNumbers":true,"fontSize":12,"fontFamily":"Arial","paperSize":"a4","chapterMark":"pagebreak","pageBreaksBefore":"/","margin":{"right":62,"left":62,"top":56,"bottom":56}},"structure":{"langs":"LANGS.md","readme":"README.md","glossary":"GLOSSARY.md","summary":"SUMMARY.md"},"variables":{},"title":"Hivemall User Manual","links":{"sidebar":{"<i class=\"fa fa-home\"></i> Home":"http://hivemall.incubator.apache.org/"}},"gitbook":"3.x.x","description":"User Manual for Apache Hivemall"},"file":{"path":"getting_started/input-format.md","mtime":"2016-11-17T10:42:51.000Z","type":"markdown"},"gitbook":{"version":"3.2.2","time":"2016-11-17T12:16:14.647Z"},"basePath":"..","book":{"language":""}}); }); </script> </div>
http://git-wip-us.apache.org/repos/asf/incubator-hivemall-site/blob/68241a08/userguide/getting_started/installation.html ---------------------------------------------------------------------- diff --git a/userguide/getting_started/installation.html b/userguide/getting_started/installation.html index f223bf0..6908e93 100644 --- a/userguide/getting_started/installation.html +++ b/userguide/getting_started/installation.html @@ -999,6 +999,21 @@ </li> + <li class="chapter " data-level="5.6" data-path="../binaryclass/titanic_rf.html"> + + <a href="../binaryclass/titanic_rf.html"> + + + <b>5.6.</b> + + Kaggle Titanic Tutorial + + </a> + + + + </li> + @@ -1664,7 +1679,25 @@ source /home/myui/tmp/define-all.hive; <pre><code>$ hive add jar /tmp/hivemall-core-xxx-with-dependencies.jar; source /tmp/define-all.hive; -</code></pre><p><div id="page-footer"><hr><p><sub><font color="gray"> +</code></pre><p><div id="page-footer"><hr><!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> +<p><sub><font color="gray"> Apache Hivemall is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. </font></sub></p> </div></p> @@ -1701,7 +1734,7 @@ Apache Hivemall is an effort undergoing incubation at The Apache Software Founda <script> var gitbook = gitbook || []; gitbook.push(function() { - gitbook.page.hasChanged({"page":{"title":"Installation","level":"1.2.1","depth":2,"next":{"title":"Install as permanent functions","level":"1.2.2","depth":2,"path":"getting_started/permanent-functions.md","ref":"getting_started/permanent-functions.md","articles":[]},"previous":{"title":"Getting Started","level":"1.2","depth":1,"path":"getting_started/README.md","ref":"getting_started/README.md","articles":[{"title":"Installation","level":"1.2.1","depth":2,"path":"getting_started/installation.md","ref":"getting_started/installation.md","articles":[]},{"title":"Install as permanent functions","level":"1.2.2","depth":2,"path":"getting_started/permanent-functions.md","ref":"getting_started/permanent-functions.md","articles":[]},{"title":"Input Format","level":"1.2.3","depth":2,"path":"getting_started/input-format.md","ref":"getting_started/input-format.md","articles":[]}]},"dir":"ltr"},"config":{"plugins":["theme-api","edit-link","github","splitter","sitemap","etoc","callout s","toggle-chapters","anchorjs","codeblock-filename","expandable-chapters","multipart","codeblock-filename","katex","emphasize","localized-footer"],"styles":{"website":"styles/website.css","pdf":"styles/pdf.css","epub":"styles/epub.css","mobi":"styles/mobi.css","ebook":"styles/ebook.css","print":"styles/print.css"},"pluginsConfig":{"emphasize":{},"callouts":{},"etoc":{"maxdepth":3,"mindepth":1,"notoc":true},"github":{"url":"https://github.com/apache/incubator-hivemall/"},"splitter":{},"search":{},"downloadpdf":{"base":"https://github.com/apache/incubator-hivemall/docs/gitbook","label":"PDF","multilingual":false},"multipart":{},"localized-footer":{"filename":"FOOTER.md"},"lunr":{"maxIndexSize":1000000,"ignoreSpecialCharacters":false},"katex":{},"fontsettings":{"theme":"white","family":"sans","size":2,"font":"sans"},"highlight":{},"codeblock-filename":{},"sitemap":{"hostname":"http://hivemall.incubator.apache.org/"},"theme-api":{"languages":[],"split":false,"theme":"dark"},"sharing":{ "facebook":true,"twitter":true,"google":false,"weibo":false,"instapaper":false,"vk":false,"all":["facebook","google","twitter","weibo","instapaper"]},"edit-link":{"label":"Edit","base":"https://github.com/apache/incubator-hivemall/docs/gitbook"},"theme-default":{"styles":{"website":"styles/website.css","pdf":"styles/pdf.css","epub":"styles/epub.css","mobi":"styles/mobi.css","ebook":"styles/ebook.css","print":"styles/print.css"},"showLevel":true},"anchorjs":{"selector":"h1,h2,h3,*:not(.callout) > h4,h5"},"toggle-chapters":{},"expandable-chapters":{}},"theme":"default","pdf":{"pageNumbers":true,"fontSize":12,"fontFamily":"Arial","paperSize":"a4","chapterMark":"pagebreak","pageBreaksBefore":"/","margin":{"right":62,"left":62,"top":56,"bottom":56}},"structure":{"langs":"LANGS.md","readme":"README.md","glossary":"GLOSSARY.md","summary":"SUMMARY.md"},"variables":{},"title":"Hivemall User Manual","links":{"sidebar":{"<i class=\"fa fa-home\"></i> Home":"http://hivemall.incubator.apache.org/ "}},"gitbook":"3.x.x","description":"User Manual for Apache Hivemall"},"file":{"path":"getting_started/installation.md","mtime":"2016-11-12T07:18:00.000Z","type":"markdown"},"gitbook":{"version":"3.2.2","time":"2016-11-14T10:40:22.987Z"},"basePath":"..","book":{"language":""}}); + gitbook.page.hasChanged({"page":{"title":"Installation","level":"1.2.1","depth":2,"next":{"title":"Install as permanent functions","level":"1.2.2","depth":2,"path":"getting_started/permanent-functions.md","ref":"getting_started/permanent-functions.md","articles":[]},"previous":{"title":"Getting Started","level":"1.2","depth":1,"path":"getting_started/README.md","ref":"getting_started/README.md","articles":[{"title":"Installation","level":"1.2.1","depth":2,"path":"getting_started/installation.md","ref":"getting_started/installation.md","articles":[]},{"title":"Install as permanent functions","level":"1.2.2","depth":2,"path":"getting_started/permanent-functions.md","ref":"getting_started/permanent-functions.md","articles":[]},{"title":"Input Format","level":"1.2.3","depth":2,"path":"getting_started/input-format.md","ref":"getting_started/input-format.md","articles":[]}]},"dir":"ltr"},"config":{"plugins":["theme-api","edit-link","github","splitter","sitemap","etoc","callout s","toggle-chapters","anchorjs","codeblock-filename","expandable-chapters","multipart","codeblock-filename","katex","emphasize","localized-footer"],"styles":{"website":"styles/website.css","pdf":"styles/pdf.css","epub":"styles/epub.css","mobi":"styles/mobi.css","ebook":"styles/ebook.css","print":"styles/print.css"},"pluginsConfig":{"emphasize":{},"callouts":{},"etoc":{"maxdepth":3,"mindepth":1,"notoc":true},"github":{"url":"https://github.com/apache/incubator-hivemall/"},"splitter":{},"search":{},"downloadpdf":{"base":"https://github.com/apache/incubator-hivemall/docs/gitbook","label":"PDF","multilingual":false},"multipart":{},"localized-footer":{"filename":"FOOTER.md"},"lunr":{"maxIndexSize":1000000,"ignoreSpecialCharacters":false},"katex":{},"fontsettings":{"theme":"white","family":"sans","size":2,"font":"sans"},"highlight":{},"codeblock-filename":{},"sitemap":{"hostname":"http://hivemall.incubator.apache.org/"},"theme-api":{"languages":[],"split":false,"theme":"dark"},"sharing":{ "facebook":true,"twitter":true,"google":false,"weibo":false,"instapaper":false,"vk":false,"all":["facebook","google","twitter","weibo","instapaper"]},"edit-link":{"label":"Edit","base":"https://github.com/apache/incubator-hivemall/docs/gitbook"},"theme-default":{"styles":{"website":"styles/website.css","pdf":"styles/pdf.css","epub":"styles/epub.css","mobi":"styles/mobi.css","ebook":"styles/ebook.css","print":"styles/print.css"},"showLevel":true},"anchorjs":{"selector":"h1,h2,h3,*:not(.callout) > h4,h5"},"toggle-chapters":{},"expandable-chapters":{}},"theme":"default","pdf":{"pageNumbers":true,"fontSize":12,"fontFamily":"Arial","paperSize":"a4","chapterMark":"pagebreak","pageBreaksBefore":"/","margin":{"right":62,"left":62,"top":56,"bottom":56}},"structure":{"langs":"LANGS.md","readme":"README.md","glossary":"GLOSSARY.md","summary":"SUMMARY.md"},"variables":{},"title":"Hivemall User Manual","links":{"sidebar":{"<i class=\"fa fa-home\"></i> Home":"http://hivemall.incubator.apache.org/ "}},"gitbook":"3.x.x","description":"User Manual for Apache Hivemall"},"file":{"path":"getting_started/installation.md","mtime":"2016-11-16T08:39:12.000Z","type":"markdown"},"gitbook":{"version":"3.2.2","time":"2016-11-17T12:16:14.647Z"},"basePath":"..","book":{"language":""}}); }); </script> </div> http://git-wip-us.apache.org/repos/asf/incubator-hivemall-site/blob/68241a08/userguide/getting_started/permanent-functions.html ---------------------------------------------------------------------- diff --git a/userguide/getting_started/permanent-functions.html b/userguide/getting_started/permanent-functions.html index e99496d..c55834f 100644 --- a/userguide/getting_started/permanent-functions.html +++ b/userguide/getting_started/permanent-functions.html @@ -999,6 +999,21 @@ </li> + <li class="chapter " data-level="5.6" data-path="../binaryclass/titanic_rf.html"> + + <a href="../binaryclass/titanic_rf.html"> + + + <b>5.6.</b> + + Kaggle Titanic Tutorial + + </a> + + + + </li> + @@ -1651,7 +1666,6 @@ --> <p>Hive v0.13 or later supports <a href="https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/DropFunction" target="_blank">permanent functions</a> that live across sessions.</p> <p>Permanent functions are useful when you are using Hive through Hiveserver or to avoid hivemall installation for each session.</p> -<p><em>Note: This feature is supported since hivemall-0.3 beta 3 or later.</em></p> <!-- toc --><div id="toc" class="toc"> <ul> @@ -1683,7 +1697,25 @@ source /tmp/define-all-as-permanent.hive; > hivemall.adagrad </code></pre> <div class="panel panel-warning"><div class="panel-heading"><h3 class="panel-title" id="caution"><i class="fa fa-exclamation-triangle"></i> Caution</h3></div><div class="panel-body"><p>You need to specify "hivemall." prefix to call hivemall UDFs in your queries if UDFs are loaded into non-default scheme, in this case <em>hivemall</em>.</p></div></div> -<p><div id="page-footer"><hr><p><sub><font color="gray"> +<p><div id="page-footer"><hr><!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> +<p><sub><font color="gray"> Apache Hivemall is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. </font></sub></p> </div></p> @@ -1720,7 +1752,7 @@ Apache Hivemall is an effort undergoing incubation at The Apache Software Founda <script> var gitbook = gitbook || []; gitbook.push(function() { - gitbook.page.hasChanged({"page":{"title":"Install as permanent functions","level":"1.2.2","depth":2,"next":{"title":"Input Format","level":"1.2.3","depth":2,"path":"getting_started/input-format.md","ref":"getting_started/input-format.md","articles":[]},"previous":{"title":"Installation","level":"1.2.1","depth":2,"path":"getting_started/installation.md","ref":"getting_started/installation.md","articles":[]},"dir":"ltr"},"config":{"plugins":["theme-api","edit-link","github","splitter","sitemap","etoc","callouts","toggle-chapters","anchorjs","codeblock-filename","expandable-chapters","multipart","codeblock-filename","katex","emphasize","localized-footer"],"styles":{"website":"styles/website.css","pdf":"styles/pdf.css","epub":"styles/epub.css","mobi":"styles/mobi.css","ebook":"styles/ebook.css","print":"styles/print.css"},"pluginsConfig":{"emphasize":{},"callouts":{},"etoc":{"maxdepth":3,"mindepth":1,"notoc":true},"github":{"url":"https://github.com/apache/incubator-hivemall /"},"splitter":{},"search":{},"downloadpdf":{"base":"https://github.com/apache/incubator-hivemall/docs/gitbook","label":"PDF","multilingual":false},"multipart":{},"localized-footer":{"filename":"FOOTER.md"},"lunr":{"maxIndexSize":1000000,"ignoreSpecialCharacters":false},"katex":{},"fontsettings":{"theme":"white","family":"sans","size":2,"font":"sans"},"highlight":{},"codeblock-filename":{},"sitemap":{"hostname":"http://hivemall.incubator.apache.org/"},"theme-api":{"languages":[],"split":false,"theme":"dark"},"sharing":{"facebook":true,"twitter":true,"google":false,"weibo":false,"instapaper":false,"vk":false,"all":["facebook","google","twitter","weibo","instapaper"]},"edit-link":{"label":"Edit","base":"https://github.com/apache/incubator-hivemall/docs/gitbook"},"theme-default":{"styles":{"website":"styles/website.css","pdf":"styles/pdf.css","epub":"styles/epub.css","mobi":"styles/mobi.css","ebook":"styles/ebook.css","print":"styles/print.css"},"showLevel":true},"anchorjs":{"selector" :"h1,h2,h3,*:not(.callout) > h4,h5"},"toggle-chapters":{},"expandable-chapters":{}},"theme":"default","pdf":{"pageNumbers":true,"fontSize":12,"fontFamily":"Arial","paperSize":"a4","chapterMark":"pagebreak","pageBreaksBefore":"/","margin":{"right":62,"left":62,"top":56,"bottom":56}},"structure":{"langs":"LANGS.md","readme":"README.md","glossary":"GLOSSARY.md","summary":"SUMMARY.md"},"variables":{},"title":"Hivemall User Manual","links":{"sidebar":{"<i class=\"fa fa-home\"></i> Home":"http://hivemall.incubator.apache.org/"}},"gitbook":"3.x.x","description":"User Manual for Apache Hivemall"},"file":{"path":"getting_started/permanent-functions.md","mtime":"2016-11-14T11:10:11.000Z","type":"markdown"},"gitbook":{"version":"3.2.2","time":"2016-11-14T11:11:31.970Z"},"basePath":"..","book":{"language":""}}); + gitbook.page.hasChanged({"page":{"title":"Install as permanent functions","level":"1.2.2","depth":2,"next":{"title":"Input Format","level":"1.2.3","depth":2,"path":"getting_started/input-format.md","ref":"getting_started/input-format.md","articles":[]},"previous":{"title":"Installation","level":"1.2.1","depth":2,"path":"getting_started/installation.md","ref":"getting_started/installation.md","articles":[]},"dir":"ltr"},"config":{"plugins":["theme-api","edit-link","github","splitter","sitemap","etoc","callouts","toggle-chapters","anchorjs","codeblock-filename","expandable-chapters","multipart","codeblock-filename","katex","emphasize","localized-footer"],"styles":{"website":"styles/website.css","pdf":"styles/pdf.css","epub":"styles/epub.css","mobi":"styles/mobi.css","ebook":"styles/ebook.css","print":"styles/print.css"},"pluginsConfig":{"emphasize":{},"callouts":{},"etoc":{"maxdepth":3,"mindepth":1,"notoc":true},"github":{"url":"https://github.com/apache/incubator-hivemall /"},"splitter":{},"search":{},"downloadpdf":{"base":"https://github.com/apache/incubator-hivemall/docs/gitbook","label":"PDF","multilingual":false},"multipart":{},"localized-footer":{"filename":"FOOTER.md"},"lunr":{"maxIndexSize":1000000,"ignoreSpecialCharacters":false},"katex":{},"fontsettings":{"theme":"white","family":"sans","size":2,"font":"sans"},"highlight":{},"codeblock-filename":{},"sitemap":{"hostname":"http://hivemall.incubator.apache.org/"},"theme-api":{"languages":[],"split":false,"theme":"dark"},"sharing":{"facebook":true,"twitter":true,"google":false,"weibo":false,"instapaper":false,"vk":false,"all":["facebook","google","twitter","weibo","instapaper"]},"edit-link":{"label":"Edit","base":"https://github.com/apache/incubator-hivemall/docs/gitbook"},"theme-default":{"styles":{"website":"styles/website.css","pdf":"styles/pdf.css","epub":"styles/epub.css","mobi":"styles/mobi.css","ebook":"styles/ebook.css","print":"styles/print.css"},"showLevel":true},"anchorjs":{"selector" :"h1,h2,h3,*:not(.callout) > h4,h5"},"toggle-chapters":{},"expandable-chapters":{}},"theme":"default","pdf":{"pageNumbers":true,"fontSize":12,"fontFamily":"Arial","paperSize":"a4","chapterMark":"pagebreak","pageBreaksBefore":"/","margin":{"right":62,"left":62,"top":56,"bottom":56}},"structure":{"langs":"LANGS.md","readme":"README.md","glossary":"GLOSSARY.md","summary":"SUMMARY.md"},"variables":{},"title":"Hivemall User Manual","links":{"sidebar":{"<i class=\"fa fa-home\"></i> Home":"http://hivemall.incubator.apache.org/"}},"gitbook":"3.x.x","description":"User Manual for Apache Hivemall"},"file":{"path":"getting_started/permanent-functions.md","mtime":"2016-11-17T09:55:29.000Z","type":"markdown"},"gitbook":{"version":"3.2.2","time":"2016-11-17T12:16:14.647Z"},"basePath":"..","book":{"language":""}}); }); </script> </div> http://git-wip-us.apache.org/repos/asf/incubator-hivemall-site/blob/68241a08/userguide/index.html ---------------------------------------------------------------------- diff --git a/userguide/index.html b/userguide/index.html index a2c73fd..d25c46e 100644 --- a/userguide/index.html +++ b/userguide/index.html @@ -997,6 +997,21 @@ </li> + <li class="chapter " data-level="5.6" data-path="binaryclass/titanic_rf.html"> + + <a href="binaryclass/titanic_rf.html"> + + + <b>5.6.</b> + + Kaggle Titanic Tutorial + + </a> + + + + </li> + @@ -1660,7 +1675,25 @@ Apache Hivemall is a collection of machine learning algorithms and versatile dat Thus, it can be considered as a cross platform library for machine learning; prediction models built by a batch query of Apache Hive can be used on Apache Spark/Pig, and conversely, prediction models build by Apache Spark can be used from Apache Hive/Pig.</p> <div style="text-align:center"><img src="resources/images/techstack.png" width="80%" height="80%"></div> -<p><div id="page-footer"><hr><p><sub><font color="gray"> +<p><div id="page-footer"><hr><!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> +<p><sub><font color="gray"> Apache Hivemall is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. </font></sub></p> </div></p> @@ -1697,7 +1730,7 @@ Apache Hivemall is an effort undergoing incubation at The Apache Software Founda <script> var gitbook = gitbook || []; gitbook.push(function() { - gitbook.page.hasChanged({"page":{"title":"Introduction","level":"1.1","depth":1,"next":{"title":"Getting Started","level":"1.2","depth":1,"path":"getting_started/README.md","ref":"getting_started/README.md","articles":[{"title":"Installation","level":"1.2.1","depth":2,"path":"getting_started/installation.md","ref":"getting_started/installation.md","articles":[]},{"title":"Install as permanent functions","level":"1.2.2","depth":2,"path":"getting_started/permanent-functions.md","ref":"getting_started/permanent-functions.md","articles":[]},{"title":"Input Format","level":"1.2.3","depth":2,"path":"getting_started/input-format.md","ref":"getting_started/input-format.md","articles":[]}]},"dir":"ltr"},"config":{"plugins":["theme-api","edit-link","github","splitter","sitemap","etoc","callouts","toggle-chapters","anchorjs","codeblock-filename","expandable-chapters","multipart","codeblock-filename","katex","emphasize","localized-footer"],"styles":{"website":"styles/website.css","p df":"styles/pdf.css","epub":"styles/epub.css","mobi":"styles/mobi.css","ebook":"styles/ebook.css","print":"styles/print.css"},"pluginsConfig":{"emphasize":{},"callouts":{},"etoc":{"maxdepth":3,"mindepth":1,"notoc":true},"github":{"url":"https://github.com/apache/incubator-hivemall/"},"splitter":{},"search":{},"downloadpdf":{"base":"https://github.com/apache/incubator-hivemall/docs/gitbook","label":"PDF","multilingual":false},"multipart":{},"localized-footer":{"filename":"FOOTER.md"},"lunr":{"maxIndexSize":1000000,"ignoreSpecialCharacters":false},"katex":{},"fontsettings":{"theme":"white","family":"sans","size":2,"font":"sans"},"highlight":{},"codeblock-filename":{},"sitemap":{"hostname":"http://hivemall.incubator.apache.org/"},"theme-api":{"languages":[],"split":false,"theme":"dark"},"sharing":{"facebook":true,"twitter":true,"google":false,"weibo":false,"instapaper":false,"vk":false,"all":["facebook","google","twitter","weibo","instapaper"]},"edit-link":{"label":"Edit","base":"https ://github.com/apache/incubator-hivemall/docs/gitbook"},"theme-default":{"styles":{"website":"styles/website.css","pdf":"styles/pdf.css","epub":"styles/epub.css","mobi":"styles/mobi.css","ebook":"styles/ebook.css","print":"styles/print.css"},"showLevel":true},"anchorjs":{"selector":"h1,h2,h3,*:not(.callout) > h4,h5"},"toggle-chapters":{},"expandable-chapters":{}},"theme":"default","pdf":{"pageNumbers":true,"fontSize":12,"fontFamily":"Arial","paperSize":"a4","chapterMark":"pagebreak","pageBreaksBefore":"/","margin":{"right":62,"left":62,"top":56,"bottom":56}},"structure":{"langs":"LANGS.md","readme":"README.md","glossary":"GLOSSARY.md","summary":"SUMMARY.md"},"variables":{},"title":"Hivemall User Manual","links":{"sidebar":{"<i class=\"fa fa-home\"></i> Home":"http://hivemall.incubator.apache.org/"}},"gitbook":"3.x.x","description":"User Manual for Apache Hivemall"},"file":{"path":"README.md","mtime":"2016-11-14T10:15:30.000Z","type":"markdown"},"gitbook":{"version":"3.2.2","time":"20 16-11-14T10:40:22.987Z"},"basePath":".","book":{"language":""}}); + gitbook.page.hasChanged({"page":{"title":"Introduction","level":"1.1","depth":1,"next":{"title":"Getting Started","level":"1.2","depth":1,"path":"getting_started/README.md","ref":"getting_started/README.md","articles":[{"title":"Installation","level":"1.2.1","depth":2,"path":"getting_started/installation.md","ref":"getting_started/installation.md","articles":[]},{"title":"Install as permanent functions","level":"1.2.2","depth":2,"path":"getting_started/permanent-functions.md","ref":"getting_started/permanent-functions.md","articles":[]},{"title":"Input Format","level":"1.2.3","depth":2,"path":"getting_started/input-format.md","ref":"getting_started/input-format.md","articles":[]}]},"dir":"ltr"},"config":{"plugins":["theme-api","edit-link","github","splitter","sitemap","etoc","callouts","toggle-chapters","anchorjs","codeblock-filename","expandable-chapters","multipart","codeblock-filename","katex","emphasize","localized-footer"],"styles":{"website":"styles/website.css","p df":"styles/pdf.css","epub":"styles/epub.css","mobi":"styles/mobi.css","ebook":"styles/ebook.css","print":"styles/print.css"},"pluginsConfig":{"emphasize":{},"callouts":{},"etoc":{"maxdepth":3,"mindepth":1,"notoc":true},"github":{"url":"https://github.com/apache/incubator-hivemall/"},"splitter":{},"search":{},"downloadpdf":{"base":"https://github.com/apache/incubator-hivemall/docs/gitbook","label":"PDF","multilingual":false},"multipart":{},"localized-footer":{"filename":"FOOTER.md"},"lunr":{"maxIndexSize":1000000,"ignoreSpecialCharacters":false},"katex":{},"fontsettings":{"theme":"white","family":"sans","size":2,"font":"sans"},"highlight":{},"codeblock-filename":{},"sitemap":{"hostname":"http://hivemall.incubator.apache.org/"},"theme-api":{"languages":[],"split":false,"theme":"dark"},"sharing":{"facebook":true,"twitter":true,"google":false,"weibo":false,"instapaper":false,"vk":false,"all":["facebook","google","twitter","weibo","instapaper"]},"edit-link":{"label":"Edit","base":"https ://github.com/apache/incubator-hivemall/docs/gitbook"},"theme-default":{"styles":{"website":"styles/website.css","pdf":"styles/pdf.css","epub":"styles/epub.css","mobi":"styles/mobi.css","ebook":"styles/ebook.css","print":"styles/print.css"},"showLevel":true},"anchorjs":{"selector":"h1,h2,h3,*:not(.callout) > h4,h5"},"toggle-chapters":{},"expandable-chapters":{}},"theme":"default","pdf":{"pageNumbers":true,"fontSize":12,"fontFamily":"Arial","paperSize":"a4","chapterMark":"pagebreak","pageBreaksBefore":"/","margin":{"right":62,"left":62,"top":56,"bottom":56}},"structure":{"langs":"LANGS.md","readme":"README.md","glossary":"GLOSSARY.md","summary":"SUMMARY.md"},"variables":{},"title":"Hivemall User Manual","links":{"sidebar":{"<i class=\"fa fa-home\"></i> Home":"http://hivemall.incubator.apache.org/"}},"gitbook":"3.x.x","description":"User Manual for Apache Hivemall"},"file":{"path":"README.md","mtime":"2016-11-16T08:39:12.000Z","type":"markdown"},"gitbook":{"version":"3.2.2","time":"20 16-11-17T12:16:14.647Z"},"basePath":".","book":{"language":""}}); }); </script> </div> http://git-wip-us.apache.org/repos/asf/incubator-hivemall-site/blob/68241a08/userguide/misc/generic_funcs.html ---------------------------------------------------------------------- diff --git a/userguide/misc/generic_funcs.html b/userguide/misc/generic_funcs.html index eec951a..f5edcc2 100644 --- a/userguide/misc/generic_funcs.html +++ b/userguide/misc/generic_funcs.html @@ -999,6 +999,21 @@ </li> + <li class="chapter " data-level="5.6" data-path="../binaryclass/titanic_rf.html"> + + <a href="../binaryclass/titanic_rf.html"> + + + <b>5.6.</b> + + Kaggle Titanic Tutorial + + </a> + + + + </li> + @@ -1650,53 +1665,74 @@ under the License. --> <p>This page describes a list of useful Hivemall generic functions.</p> +<!-- toc --><div id="toc" class="toc"> + +<ul> +<li><a href="#array-functions">Array functions</a><ul> +<li><a href="#array-udfs">Array UDFs</a></li> +<li><a href="#array-udafs">Array UDAFs</a></li> +</ul> +</li> +<li><a href="#bitset-functions">Bitset functions</a><ul> +<li><a href="#bitset-udf">Bitset UDF</a></li> +<li><a href="#bitset-udaf">Bitset UDAF</a></li> +</ul> +</li> +<li><a href="#compression-functions">Compression functions</a></li> +<li><a href="#map-functions">Map functions</a><ul> +<li><a href="#map-udfs">Map UDFs</a></li> +<li><a href="#map-udafs">MAP UDAFs</a></li> +</ul> +</li> +<li><a href="#mapreduce-functions">MapReduce functions</a></li> +<li><a href="#math-functions">Math functions</a></li> +<li><a href="#text-processing-functions">Text processing functions</a></li> +<li><a href="#other-functions">Other functions</a></li> +</ul> + +</div><!-- tocstop --> <h1 id="array-functions">Array functions</h1> <h2 id="array-udfs">Array UDFs</h2> <ul> -<li><code>array_concat(array<ANY> x1, array<ANY> x2, ..)</code> - Returns a concatenated array</li> -</ul> -<pre><code class="lang-sql">select array_concat(array(1),array(2,3)); -> [1,2,3] +<li><p><code>array_concat(array<ANY> x1, array<ANY> x2, ..)</code> - Returns a concatenated array</p> +<pre><code class="lang-sql"> select array_concat(array(1),array(2,3)); + > [1,2,3] </code></pre> -<ul> -<li><code>array_intersect(array<ANY> x1, array<ANY> x2, ..)</code> - Returns an intersect of given arrays</li> -</ul> -<pre><code class="lang-sql">select array_intersect(array(1,3,4),array(2,3,4),array(3,5)); -> [3] +</li> +<li><p><code>array_intersect(array<ANY> x1, array<ANY> x2, ..)</code> - Returns an intersect of given arrays</p> +<pre><code class="lang-sql"> select array_intersect(array(1,3,4),array(2,3,4),array(3,5)); + > [3] </code></pre> -<ul> -<li><code>array_remove(array<int|text> original, int|text|array<int> target)</code> - Returns an array that the target is removed from the original array</li> -</ul> -<pre><code class="lang-sql">select array_remove(array(1,null,3),array(null)); -> [3] +</li> +<li><p><code>array_remove(array<int|text> original, int|text|array<int> target)</code> - Returns an array that the target is removed from the original array</p> +<pre><code class="lang-sql"> select array_remove(array(1,null,3),array(null)); + > [3] -select array_remove(array("aaa","bbb"),"bbb"); -> ["aaa"] + select array_remove(array("aaa","bbb"),"bbb"); + > ["aaa"] </code></pre> -<ul> -<li><code>sort_and_uniq_array(array<int>)</code> - Takes an array of type int and returns a sorted array in a natural order with duplicate elements eliminated</li> -</ul> -<pre><code class="lang-sql">select sort_and_uniq_array(array(3,1,1,-2,10)); -> [-2,1,3,10] +</li> +<li><p><code>sort_and_uniq_array(array<int>)</code> - Takes an array of type INT and returns a sorted array in a natural order with duplicate elements eliminated</p> +<pre><code class="lang-sql"> select sort_and_uniq_array(array(3,1,1,-2,10)); + > [-2,1,3,10] </code></pre> -<ul> -<li><code>subarray_endwith(array<int|text> original, int|text key)</code> - Returns an array that ends with the specified key</li> -</ul> -<pre><code class="lang-sql">select subarray_endwith(array(1,2,3,4), 3); -> [1,2,3] +</li> +<li><p><code>subarray_endwith(array<int|text> original, int|text key)</code> - Returns an array that ends with the specified key</p> +<pre><code class="lang-sql"> select subarray_endwith(array(1,2,3,4), 3); + > [1,2,3] </code></pre> -<ul> -<li><code>subarray_startwith(array<int|text> original, int|text key)</code> - Returns an array that starts with the specified key</li> -</ul> -<pre><code class="lang-sql">select subarray_startwith(array(1,2,3,4), 2); -> [2,3,4] +</li> +<li><p><code>subarray_startwith(array<int|text> original, int|text key)</code> - Returns an array that starts with the specified key</p> +<pre><code class="lang-sql"> select subarray_startwith(array(1,2,3,4), 2); + > [2,3,4] </code></pre> -<ul> -<li><code>subarray(array<int> orignal, int fromIndex, int toIndex)</code> - Returns a slice of the original array between the inclusive fromIndex and the exclusive toIndex</li> -</ul> -<pre><code class="lang-sql">select subarray(array(1,2,3,4,5,6), 2,4); -> [3,4] +</li> +<li><p><code>subarray(array<int> orignal, int fromIndex, int toIndex)</code> - Returns a slice of the original array between the inclusive <code>fromIndex</code> and the exclusive <code>toIndex</code></p> +<pre><code class="lang-sql"> select subarray(array(1,2,3,4,5,6), 2,4); + > [3,4] </code></pre> +</li> +</ul> <h2 id="array-udafs">Array UDAFs</h2> <ul> <li><p><code>array_avg(array<NUMBER>)</code> - Returns an array<double> in which each element is the mean of a set of numbers</double></p> @@ -1707,41 +1743,40 @@ select array_remove(array("aaa","bbb"),"bbb"); <h1 id="bitset-functions">Bitset functions</h1> <h2 id="bitset-udf">Bitset UDF</h2> <ul> -<li><code>to_bits(int[] indexes)</code> - Returns an bitset representation if the given indexes in long[]</li> -</ul> -<pre><code class="lang-sql">select to_bits(array(1,2,3,128)); ->[14,-9223372036854775808] +<li><p><code>to_bits(int[] indexes)</code> - Returns an bitset representation if the given indexes in long[]</p> +<pre><code class="lang-sql"> select to_bits(array(1,2,3,128)); + >[14,-9223372036854775808] </code></pre> -<ul> -<li><code>unbits(long[] bitset)</code> - Returns an long array of the give bitset representation</li> -</ul> -<pre><code class="lang-sql">select unbits(to_bits(array(1,4,2,3))); -> [1,2,3,4] +</li> +<li><p><code>unbits(long[] bitset)</code> - Returns an long array of the give bitset representation</p> +<pre><code class="lang-sql"> select unbits(to_bits(array(1,4,2,3))); + > [1,2,3,4] </code></pre> -<ul> -<li><code>bits_or(array<long> b1, array<long> b2, ..)</code> - Returns a logical OR given bitsets</li> -</ul> -<pre><code class="lang-sql">select unbits(bits_or(to_bits(array(1,4)),to_bits(array(2,3)))); -> [1,2,3,4] +</li> +<li><p><code>bits_or(array<long> b1, array<long> b2, ..)</code> - Returns a logical OR given bitsets</p> +<pre><code class="lang-sql"> select unbits(bits_or(to_bits(array(1,4)),to_bits(array(2,3)))); + > [1,2,3,4] </code></pre> +</li> +</ul> <h2 id="bitset-udaf">Bitset UDAF</h2> <ul> <li><code>bits_collect(int|long x)</code> - Returns a bitset in array<long></long></li> </ul> <h1 id="compression-functions">Compression functions</h1> <ul> -<li><code>deflate(TEXT data [, const int compressionLevel])</code> - Returns a compressed BINARY obeject by using Deflater. -The compression level must be in range [-1,9]</li> -</ul> -<pre><code class="lang-sql">select base91(deflate('aaaaaaaaaaaaaaaabbbbccc')); -> AA+=kaIM|WTt!+wbGAA +<li><p><code>deflate(TEXT data [, const int compressionLevel])</code> - Returns a compressed BINARY object by using Deflater. +The compression level must be in range [-1,9]</p> +<pre><code class="lang-sql"> select base91(deflate('aaaaaaaaaaaaaaaabbbbccc')); + > AA+=kaIM|WTt!+wbGAA </code></pre> -<ul> -<li><code>inflate(BINARY compressedData)</code> - Returns a decompressed STRING by using Inflater</li> -</ul> -<pre><code class="lang-sql">select inflate(unbase91(base91(deflate('aaaaaaaaaaaaaaaabbbbccc')))); -> aaaaaaaaaaaaaaaabbbbccc +</li> +<li><p><code>inflate(BINARY compressedData)</code> - Returns a decompressed STRING by using Inflater</p> +<pre><code class="lang-sql"> select inflate(unbase91(base91(deflate('aaaaaaaaaaaaaaaabbbbccc')))); + > aaaaaaaaaaaaaaaabbbbccc </code></pre> +</li> +</ul> <h1 id="map-functions">Map functions</h1> <h2 id="map-udfs">Map UDFs</h2> <ul> @@ -1766,81 +1801,88 @@ The compression level must be in range [-1,9]</li> </ul> <h1 id="math-functions">Math functions</h1> <ul> -<li><code>sigmoid(x)</code> - Returns 1.0 / (1.0 + exp(-x))</li> +<li><code>sigmoid(x)</code> - Returns <code>1.0 / (1.0 + exp(-x))</code></li> </ul> <h1 id="text-processing-functions">Text processing functions</h1> <ul> -<li><code>base91(binary)</code> - Convert the argument from binary to a BASE91 string</li> -</ul> -<pre><code class="lang-sql">select base91(deflate('aaaaaaaaaaaaaaaabbbbccc')); -> AA+=kaIM|WTt!+wbGAA +<li><p><code>base91(binary)</code> - Convert the argument from binary to a BASE91 string</p> +<pre><code class="lang-sql"> select base91(deflate('aaaaaaaaaaaaaaaabbbbccc')); + > AA+=kaIM|WTt!+wbGAA </code></pre> -<ul> -<li><code>unbase91(string)</code> - Convert a BASE91 string to a binary</li> -</ul> -<pre><code class="lang-sql">select inflate(unbase91(base91(deflate('aaaaaaaaaaaaaaaabbbbccc')))); -> aaaaaaaaaaaaaaaabbbbccc +</li> +<li><p><code>unbase91(string)</code> - Convert a BASE91 string to a binary</p> +<pre><code class="lang-sql"> select inflate(unbase91(base91(deflate('aaaaaaaaaaaaaaaabbbbccc')))); + > aaaaaaaaaaaaaaaabbbbccc </code></pre> -<ul> -<li><code>normalize_unicode(string str [, string form])</code> - Transforms <code>str</code> with the specified normalization form. The <code>form</code> takes one of NFC (default), NFD, NFKC, or NFKD</li> -</ul> -<pre><code class="lang-sql">select normalize_unicode('ハンカクカナ','NFKC'); -> ハンカクカナ +</li> +<li><p><code>normalize_unicode(string str [, string form])</code> - Transforms <code>str</code> with the specified normalization form. The <code>form</code> takes one of NFC (default), NFD, NFKC, or NFKD</p> +<pre><code class="lang-sql"> select normalize_unicode('ハンカクカナ','NFKC'); + > ハンカクカナ -select normalize_unicode('㈱㌧㌦Ⅲ','NFKC'); -> (株)トンドルIII + select normalize_unicode('㈱㌧㌦Ⅲ','NFKC'); + > (株)トンドルIII </code></pre> -<ul> +</li> <li><p><code>split_words(string query [, string regex])</code> - Returns an array<text> containing splitted strings</text></p> </li> <li><p><code>is_stopword(string word)</code> - Returns whether English stopword or not</p> </li> <li><p><code>tokenize(string englishText [, boolean toLowerCase])</code> - Returns words in array<string></string></p> </li> -<li><p><code>tokenize_ja(String line [, const string mode = "normal", const list<string> stopWords, const list<string> stopTags])</code> - returns tokenized strings in array<string></string></p> -</li> -</ul> -<pre><code class="lang-sql">select tokenize_ja("kuromojiを使った分かち書きのテストです。第二引数にはnormal/search/extendedを指定できます。デフォルトではnormalモードです。"); +<li><p><code>tokenize_ja(String line [, const string mode = "normal", const list<string> stopWords, const list<string> stopTags])</code> - returns tokenized strings in array<string>. Refer <a href="tokenizer.html">this article</a> for detail.</string></p> +<pre><code class="lang-sql"> select tokenize_ja("kuromojiを使った分かち書きのテストです。第二引数にはnormal/search/extendedを指定できます。デフォルトではnormalモードです。"); -> ["kuromoji","使う","分かち書き","テスト","第","二","引数","normal","search","extended","指定","デフォルト","normal"," モード"] + > ["kuromoji","使う","分かち書き","テスト","第","二","引数","normal","search","extended","指定","デフォルト","normal"," モード"] </code></pre> -<p><a href="https://github.com/myui/hivemall/wiki/Tokenizer" target="_blank">https://github.com/myui/hivemall/wiki/Tokenizer</a></p> +</li> +</ul> <h1 id="other-functions">Other functions</h1> <ul> <li><p><code>convert_label(const int|const float)</code> - Convert from -1|1 to 0.0f|1.0f, or from 0.0f|1.0f to -1|1</p> </li> -<li><p><code>each_top_k(int K, Object group, double cmpKey, *)</code> - Returns top-K values (or tail-K values when k is less than 0)</p> +<li><p><code>each_top_k(int K, Object group, double cmpKey, *)</code> - Returns top-K values (or tail-K values when k is less than 0). Refer <a href="topk.html">this article</a> for detail.</p> </li> -</ul> -<p><a href="https://github.com/myui/hivemall/wiki/Efficient-Top-k-computation-on-Apache-Hive-using-Hivemall-UDTF" target="_blank">https://github.com/myui/hivemall/wiki/Efficient-Top-k-computation-on-Apache-Hive-using-Hivemall-UDTF</a></p> -<ul> -<li><code>generate_series(const int|bigint start, const int|bigint end)</code> - Generate a series of values, from start to end</li> -</ul> -<pre><code class="lang-sql">WITH dual as ( - <span class="hljs-keyword">select</span> <span class="hljs-number">1</span> -) -<span class="hljs-keyword">select</span> generate_series(<span class="hljs-number">1</span>,<span class="hljs-number">9</span>) -<span class="hljs-keyword">from</span> dual; - -1 -2 -3 -4 -5 -6 -7 -8 -9 +<li><p><code>generate_series(const int|bigint start, const int|bigint end)</code> - Generate a series of values, from start to end</p> +<pre><code class="lang-sql"> <span class="hljs-keyword">select</span> generate_series(<span class="hljs-number">1</span>,<span class="hljs-number">9</span>); + + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 </code></pre> -<p>A similar function to PostgreSQL's <code>generate_serics</code>. -<a href="http://www.postgresql.org/docs/current/static/functions-srf.html" target="_blank">http://www.postgresql.org/docs/current/static/functions-srf.html</a></p> -<ul> -<li><code>x_rank(KEY)</code> - Generates a pseudo sequence number starting from 1 for each key -<div id="page-footer"><hr><p><sub><font color="gray"> +<p> A similar function to PostgreSQL's <code>generate_serics</code>. + <a href="http://www.postgresql.org/docs/current/static/functions-srf.html" target="_blank">http://www.postgresql.org/docs/current/static/functions-srf.html</a></p> +</li> +<li><p><code>x_rank(KEY)</code> - Generates a pseudo sequence number starting from 1 for each key</p> +</li> +</ul> +<p><div id="page-footer"><hr><!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> +<p><sub><font color="gray"> Apache Hivemall is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. </font></sub></p> -</div></li> -</ul> +</div></p> </section> @@ -1874,7 +1916,7 @@ Apache Hivemall is an effort undergoing incubation at The Apache Software Founda <script> var gitbook = gitbook || []; gitbook.push(function() { - gitbook.page.hasChanged({"page":{"title":"List of generic Hivemall functions","level":"2.1","depth":1,"next":{"title":"Efficient Top-K query processing","level":"2.2","depth":1,"path":"misc/topk.md","ref":"misc/topk.md","articles":[]},"previous":{"title":"Map-side Join causes ClassCastException on Tez","level":"1.5.5","depth":2,"path":"troubleshooting/mapjoin_classcastex.md","ref":"troubleshooting/mapjoin_classcastex.md","articles":[]},"dir":"ltr"},"config":{"plugins":["theme-api","edit-link","github","splitter","sitemap","etoc","callouts","toggle-chapters","anchorjs","codeblock-filename","expandable-chapters","multipart","codeblock-filename","katex","emphasize","localized-footer"],"styles":{"website":"styles/website.css","pdf":"styles/pdf.css","epub":"styles/epub.css","mobi":"styles/mobi.css","ebook":"styles/ebook.css","print":"styles/print.css"},"pluginsConfig":{"emphasize":{},"callouts":{},"etoc":{"maxdepth":3,"mindepth":1,"notoc":true},"github":{"url":"https://github .com/apache/incubator-hivemall/"},"splitter":{},"search":{},"downloadpdf":{"base":"https://github.com/apache/incubator-hivemall/docs/gitbook","label":"PDF","multilingual":false},"multipart":{},"localized-footer":{"filename":"FOOTER.md"},"lunr":{"maxIndexSize":1000000,"ignoreSpecialCharacters":false},"katex":{},"fontsettings":{"theme":"white","family":"sans","size":2,"font":"sans"},"highlight":{},"codeblock-filename":{},"sitemap":{"hostname":"http://hivemall.incubator.apache.org/"},"theme-api":{"languages":[],"split":false,"theme":"dark"},"sharing":{"facebook":true,"twitter":true,"google":false,"weibo":false,"instapaper":false,"vk":false,"all":["facebook","google","twitter","weibo","instapaper"]},"edit-link":{"label":"Edit","base":"https://github.com/apache/incubator-hivemall/docs/gitbook"},"theme-default":{"styles":{"website":"styles/website.css","pdf":"styles/pdf.css","epub":"styles/epub.css","mobi":"styles/mobi.css","ebook":"styles/ebook.css","print":"styles/print.css"},"showLevel ":true},"anchorjs":{"selector":"h1,h2,h3,*:not(.callout) > h4,h5"},"toggle-chapters":{},"expandable-chapters":{}},"theme":"default","pdf":{"pageNumbers":true,"fontSize":12,"fontFamily":"Arial","paperSize":"a4","chapterMark":"pagebreak","pageBreaksBefore":"/","margin":{"right":62,"left":62,"top":56,"bottom":56}},"structure":{"langs":"LANGS.md","readme":"README.md","glossary":"GLOSSARY.md","summary":"SUMMARY.md"},"variables":{},"title":"Hivemall User Manual","links":{"sidebar":{"<i class=\"fa fa-home\"></i> Home":"http://hivemall.incubator.apache.org/"}},"gitbook":"3.x.x","description":"User Manual for Apache Hivemall"},"file":{"path":"misc/generic_funcs.md","mtime":"2016-11-12T07:18:00.000Z","type":"markdown"},"gitbook":{"version":"3.2.2","time":"2016-11-14T10:40:22.987Z"},"basePath":"..","book":{"language":""}}); + gitbook.page.hasChanged({"page":{"title":"List of generic Hivemall functions","level":"2.1","depth":1,"next":{"title":"Efficient Top-K query processing","level":"2.2","depth":1,"path":"misc/topk.md","ref":"misc/topk.md","articles":[]},"previous":{"title":"Map-side Join causes ClassCastException on Tez","level":"1.5.5","depth":2,"path":"troubleshooting/mapjoin_classcastex.md","ref":"troubleshooting/mapjoin_classcastex.md","articles":[]},"dir":"ltr"},"config":{"plugins":["theme-api","edit-link","github","splitter","sitemap","etoc","callouts","toggle-chapters","anchorjs","codeblock-filename","expandable-chapters","multipart","codeblock-filename","katex","emphasize","localized-footer"],"styles":{"website":"styles/website.css","pdf":"styles/pdf.css","epub":"styles/epub.css","mobi":"styles/mobi.css","ebook":"styles/ebook.css","print":"styles/print.css"},"pluginsConfig":{"emphasize":{},"callouts":{},"etoc":{"maxdepth":3,"mindepth":1,"notoc":true},"github":{"url":"https://github .com/apache/incubator-hivemall/"},"splitter":{},"search":{},"downloadpdf":{"base":"https://github.com/apache/incubator-hivemall/docs/gitbook","label":"PDF","multilingual":false},"multipart":{},"localized-footer":{"filename":"FOOTER.md"},"lunr":{"maxIndexSize":1000000,"ignoreSpecialCharacters":false},"katex":{},"fontsettings":{"theme":"white","family":"sans","size":2,"font":"sans"},"highlight":{},"codeblock-filename":{},"sitemap":{"hostname":"http://hivemall.incubator.apache.org/"},"theme-api":{"languages":[],"split":false,"theme":"dark"},"sharing":{"facebook":true,"twitter":true,"google":false,"weibo":false,"instapaper":false,"vk":false,"all":["facebook","google","twitter","weibo","instapaper"]},"edit-link":{"label":"Edit","base":"https://github.com/apache/incubator-hivemall/docs/gitbook"},"theme-default":{"styles":{"website":"styles/website.css","pdf":"styles/pdf.css","epub":"styles/epub.css","mobi":"styles/mobi.css","ebook":"styles/ebook.css","print":"styles/print.css"},"showLevel ":true},"anchorjs":{"selector":"h1,h2,h3,*:not(.callout) > h4,h5"},"toggle-chapters":{},"expandable-chapters":{}},"theme":"default","pdf":{"pageNumbers":true,"fontSize":12,"fontFamily":"Arial","paperSize":"a4","chapterMark":"pagebreak","pageBreaksBefore":"/","margin":{"right":62,"left":62,"top":56,"bottom":56}},"structure":{"langs":"LANGS.md","readme":"README.md","glossary":"GLOSSARY.md","summary":"SUMMARY.md"},"variables":{},"title":"Hivemall User Manual","links":{"sidebar":{"<i class=\"fa fa-home\"></i> Home":"http://hivemall.incubator.apache.org/"}},"gitbook":"3.x.x","description":"User Manual for Apache Hivemall"},"file":{"path":"misc/generic_funcs.md","mtime":"2016-11-17T11:12:15.000Z","type":"markdown"},"gitbook":{"version":"3.2.2","time":"2016-11-17T12:16:14.647Z"},"basePath":"..","book":{"language":""}}); }); </script> </div> http://git-wip-us.apache.org/repos/asf/incubator-hivemall-site/blob/68241a08/userguide/misc/tokenizer.html ---------------------------------------------------------------------- diff --git a/userguide/misc/tokenizer.html b/userguide/misc/tokenizer.html index e0d3959..af62ee8 100644 --- a/userguide/misc/tokenizer.html +++ b/userguide/misc/tokenizer.html @@ -999,6 +999,21 @@ </li> + <li class="chapter " data-level="5.6" data-path="../binaryclass/titanic_rf.html"> + + <a href="../binaryclass/titanic_rf.html"> + + + <b>5.6.</b> + + Kaggle Titanic Tutorial + + </a> + + + + </li> + @@ -1671,7 +1686,25 @@ <p>["kuromoji","使う","分かち書き","テスト","第","二","引数","normal","search","extended","指定","デフォルト","normal","モード"]</p> </blockquote> <p>For detailed APIs, please refer Javadoc of <a href="https://lucene.apache.org/core/5_3_1/analyzers-kuromoji/org/apache/lucene/analysis/ja/JapaneseAnalyzer.html" target="_blank">JapaneseAnalyzer</a> as well. -<div id="page-footer"><hr><p><sub><font color="gray"> +<div id="page-footer"><hr><!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> +<p><sub><font color="gray"> Apache Hivemall is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. </font></sub></p> </div></p> @@ -1708,7 +1741,7 @@ Apache Hivemall is an effort undergoing incubation at The Apache Software Founda <script> var gitbook = gitbook || []; gitbook.push(function() { - gitbook.page.hasChanged({"page":{"title":"English/Japanese Text Tokenizer","level":"2.3","depth":1,"next":{"title":"Feature Scaling","level":"3.1","depth":1,"path":"ft_engineering/scaling.md","ref":"ft_engineering/scaling.md","articles":[]},"previous":{"title":"Efficient Top-K query processing","level":"2.2","depth":1,"path":"misc/topk.md","ref":"misc/topk.md","articles":[]},"dir":"ltr"},"config":{"plugins":["theme-api","edit-link","github","splitter","sitemap","etoc","callouts","toggle-chapters","anchorjs","codeblock-filename","expandable-chapters","multipart","codeblock-filename","katex","emphasize","localized-footer"],"styles":{"website":"styles/website.css","pdf":"styles/pdf.css","epub":"styles/epub.css","mobi":"styles/mobi.css","ebook":"styles/ebook.css","print":"styles/print.css"},"pluginsConfig":{"emphasize":{},"callouts":{},"etoc":{"maxdepth":3,"mindepth":1,"notoc":true},"github":{"url":"https://github.com/apache/incubator-hivemall/"},"splitter":{},"search":{},"d ownloadpdf":{"base":"https://github.com/apache/incubator-hivemall/docs/gitbook","label":"PDF","multilingual":false},"multipart":{},"localized-footer":{"filename":"FOOTER.md"},"lunr":{"maxIndexSize":1000000,"ignoreSpecialCharacters":false},"katex":{},"fontsettings":{"theme":"white","family":"sans","size":2,"font":"sans"},"highlight":{},"codeblock-filename":{},"sitemap":{"hostname":"http://hivemall.incubator.apache.org/"},"theme-api":{"languages":[],"split":false,"theme":"dark"},"sharing":{"facebook":true,"twitter":true,"google":false,"weibo":false,"instapaper":false,"vk":false,"all":["facebook","google","twitter","weibo","instapaper"]},"edit-link":{"label":"Edit","base":"https://github.com/apache/incubator-hivemall/docs/gitbook"},"theme-default":{"styles":{"website":"styles/website.css","pdf":"styles/pdf.css","epub":"styles/epub.css","mobi":"styles/mobi.css","ebook":"styles/ebook.css","print":"styles/print.css"},"showLevel":true},"anchorjs":{"selector":"h1,h2,h3,*:not(.callout) > h4, h5"},"toggle-chapters":{},"expandable-chapters":{}},"theme":"default","pdf":{"pageNumbers":true,"fontSize":12,"fontFamily":"Arial","paperSize":"a4","chapterMark":"pagebreak","pageBreaksBefore":"/","margin":{"right":62,"left":62,"top":56,"bottom":56}},"structure":{"langs":"LANGS.md","readme":"README.md","glossary":"GLOSSARY.md","summary":"SUMMARY.md"},"variables":{},"title":"Hivemall User Manual","links":{"sidebar":{"<i class=\"fa fa-home\"></i> Home":"http://hivemall.incubator.apache.org/"}},"gitbook":"3.x.x","description":"User Manual for Apache Hivemall"},"file":{"path":"misc/tokenizer.md","mtime":"2016-11-12T07:18:00.000Z","type":"markdown"},"gitbook":{"version":"3.2.2","time":"2016-11-14T10:40:22.987Z"},"basePath":"..","book":{"language":""}}); + gitbook.page.hasChanged({"page":{"title":"English/Japanese Text Tokenizer","level":"2.3","depth":1,"next":{"title":"Feature Scaling","level":"3.1","depth":1,"path":"ft_engineering/scaling.md","ref":"ft_engineering/scaling.md","articles":[]},"previous":{"title":"Efficient Top-K query processing","level":"2.2","depth":1,"path":"misc/topk.md","ref":"misc/topk.md","articles":[]},"dir":"ltr"},"config":{"plugins":["theme-api","edit-link","github","splitter","sitemap","etoc","callouts","toggle-chapters","anchorjs","codeblock-filename","expandable-chapters","multipart","codeblock-filename","katex","emphasize","localized-footer"],"styles":{"website":"styles/website.css","pdf":"styles/pdf.css","epub":"styles/epub.css","mobi":"styles/mobi.css","ebook":"styles/ebook.css","print":"styles/print.css"},"pluginsConfig":{"emphasize":{},"callouts":{},"etoc":{"maxdepth":3,"mindepth":1,"notoc":true},"github":{"url":"https://github.com/apache/incubator-hivemall/"},"splitter":{},"search":{},"d ownloadpdf":{"base":"https://github.com/apache/incubator-hivemall/docs/gitbook","label":"PDF","multilingual":false},"multipart":{},"localized-footer":{"filename":"FOOTER.md"},"lunr":{"maxIndexSize":1000000,"ignoreSpecialCharacters":false},"katex":{},"fontsettings":{"theme":"white","family":"sans","size":2,"font":"sans"},"highlight":{},"codeblock-filename":{},"sitemap":{"hostname":"http://hivemall.incubator.apache.org/"},"theme-api":{"languages":[],"split":false,"theme":"dark"},"sharing":{"facebook":true,"twitter":true,"google":false,"weibo":false,"instapaper":false,"vk":false,"all":["facebook","google","twitter","weibo","instapaper"]},"edit-link":{"label":"Edit","base":"https://github.com/apache/incubator-hivemall/docs/gitbook"},"theme-default":{"styles":{"website":"styles/website.css","pdf":"styles/pdf.css","epub":"styles/epub.css","mobi":"styles/mobi.css","ebook":"styles/ebook.css","print":"styles/print.css"},"showLevel":true},"anchorjs":{"selector":"h1,h2,h3,*:not(.callout) > h4, h5"},"toggle-chapters":{},"expandable-chapters":{}},"theme":"default","pdf":{"pageNumbers":true,"fontSize":12,"fontFamily":"Arial","paperSize":"a4","chapterMark":"pagebreak","pageBreaksBefore":"/","margin":{"right":62,"left":62,"top":56,"bottom":56}},"structure":{"langs":"LANGS.md","readme":"README.md","glossary":"GLOSSARY.md","summary":"SUMMARY.md"},"variables":{},"title":"Hivemall User Manual","links":{"sidebar":{"<i class=\"fa fa-home\"></i> Home":"http://hivemall.incubator.apache.org/"}},"gitbook":"3.x.x","description":"User Manual for Apache Hivemall"},"file":{"path":"misc/tokenizer.md","mtime":"2016-11-16T08:39:12.000Z","type":"markdown"},"gitbook":{"version":"3.2.2","time":"2016-11-17T12:16:14.647Z"},"basePath":"..","book":{"language":""}}); }); </script> </div> http://git-wip-us.apache.org/repos/asf/incubator-hivemall-site/blob/68241a08/userguide/misc/topk.html ---------------------------------------------------------------------- diff --git a/userguide/misc/topk.html b/userguide/misc/topk.html index 92a9d2f..8d4690c 100644 --- a/userguide/misc/topk.html +++ b/userguide/misc/topk.html @@ -999,6 +999,21 @@ </li> + <li class="chapter " data-level="5.6" data-path="../binaryclass/titanic_rf.html"> + + <a href="../binaryclass/titanic_rf.html"> + + + <b>5.6.</b> + + Kaggle Titanic Tutorial + + </a> + + + + </li> + @@ -1652,7 +1667,25 @@ <p><code>each_top_k(int k, ANY group, double value, arg1, arg2, ..., argN)</code> returns a top-k records for each <code>group</code>. It returns a relation consists of <code>(int rank, double value, arg1, arg2, .., argN)</code>.</p> <p>This function is particularly useful for applying a similarity/distance function where the computation complexity is <strong>O(nm)</strong>.</p> <p><code>each_top_k</code> is very fast when compared to other methods running top-k queries (e.g., <a href="https://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/" target="_blank"><code>rank/distribute by</code></a>) in Hive.</p> -<h2 id="caution">Caution</h2> +<!-- toc --><div id="toc" class="toc"> + +<ul> +<li><a href="#caution">Caution</a></li> +<li><a href="#usage">Usage</a><ul> +<li><a href="#efficient-top-k-query-processing-using-eachtopk">Efficient Top-k Query Processing using <code>each_top_k</code></a></li> +<li><a href="#top-k-clicks">top-k clicks</a></li> +<li><a href="#top-k-similarity-computation">Top-k similarity computation</a><ul> +<li><a href="#explicit-grouping-using-distribute-by-and-sort-by">Explicit grouping using <code>distribute by</code> and <code>sort by</code></a></li> +<li><a href="#parallelization-of-similarity-computation-using-with-clause">Parallelization of similarity computation using WITH clause</a></li> +</ul> +</li> +<li><a href="#tail-k">tail-K</a></li> +</ul> +</li> +</ul> + +</div><!-- tocstop --> +<h1 id="caution">Caution</h1> <ul> <li><code>each_top_k</code> is supported from Hivemall v0.3.2-3 or later.</li> <li>This UDTF assumes that input records are sorted by <code>group</code>. Use <code>DISTRIBUTE BY group SORT BY group</code> to ensure that. Or, you can use <code>LEFT OUTER JOIN</code> for certain cases.</li> @@ -1662,7 +1695,8 @@ <li>If k is less than 0, reverse order is used and <code>tail-K</code> records are returned for each <code>group</code>.</li> <li>Note that this function returns <a href="http://www.michaelpollmeier.com/selecting-top-k-items-from-a-list-efficiently-in-java-groovy/" target="_blank">a pseudo ranking</a> for top-k. It always returns <code>at-most K</code> records for each group. The ranking scheme is similar to <code>dense_rank</code> but slightly different in certain cases.</li> </ul> -<h1 id="efficient-top-k-query-processing-using-eachtopk">Efficient Top-k Query Processing using <code>each_top_k</code></h1> +<h1 id="usage">Usage</h1> +<h2 id="efficient-top-k-query-processing-using-eachtopk">Efficient Top-k Query Processing using <code>each_top_k</code></h2> <p>Efficient processing of Top-k queries is a crucial requirement in many interactive environments that involve massive amounts of data. Our Hive extension <code>each_top_k</code> helps running Top-k processing efficiently.</p> <ul> @@ -1778,7 +1812,6 @@ Do null hanlding like <code>if(value is null, -1, value)</code> to avoid null.</ <p>If <code>k</code> is less than 0, reverse order is used and tail-K records are returned for each <code>group</code>.</p> <p>The ranking semantics of <code>each_top_k</code> follows SQL's <code>dense_rank</code> and then limits results by <code>k</code>. </p> <div class="panel panel-warning"><div class="panel-heading"><h3 class="panel-title" id="caution"><i class="fa fa-exclamation-triangle"></i> Caution</h3></div><div class="panel-body"><p><code>each_top_k</code> is benefical where the number of grouping keys are large. If the number of grouping keys are not so large (e.g., less than 100), consider using <code>rank() over</code> instead.</p></div></div> -<h1 id="usage">Usage</h1> <h2 id="top-k-clicks">top-k clicks</h2> <p><a href="http://stackoverflow.com/questions/9390698/hive-getting-top-n-records-in-group-by-query/32559050#32559050" target="_blank">http://stackoverflow.com/questions/9390698/hive-getting-top-n-records-in-group-by-query/32559050#32559050</a></p> <pre><code class="lang-sql"><span class="hljs-keyword">set</span> hivevar:k=<span class="hljs-number">5</span>; @@ -2422,7 +2455,7 @@ Apache Hivemall is an effort undergoing incubation at The Apache Software Founda <script> var gitbook = gitbook || []; gitbook.push(function() { - gitbook.page.hasChanged({"page":{"title":"Efficient Top-K query processing","level":"2.2","depth":1,"next":{"title":"English/Japanese Text Tokenizer","level":"2.3","depth":1,"path":"misc/tokenizer.md","ref":"misc/tokenizer.md","articles":[]},"previous":{"title":"List of generic Hivemall functions","level":"2.1","depth":1,"path":"misc/generic_funcs.md","ref":"misc/generic_funcs.md","articles":[]},"dir":"ltr"},"config":{"plugins":["theme-api","edit-link","github","splitter","sitemap","etoc","callouts","toggle-chapters","anchorjs","codeblock-filename","expandable-chapters","multipart","codeblock-filename","katex","emphasize","localized-footer"],"styles":{"website":"styles/website.css","pdf":"styles/pdf.css","epub":"styles/epub.css","mobi":"styles/mobi.css","ebook":"styles/ebook.css","print":"styles/print.css"},"pluginsConfig":{"emphasize":{},"callouts":{},"etoc":{"maxdepth":3,"mindepth":1,"notoc":true},"github":{"url":"https://github.com/apache/incubator-hivemall/"},"splitt er":{},"search":{},"downloadpdf":{"base":"https://github.com/apache/incubator-hivemall/docs/gitbook","label":"PDF","multilingual":false},"multipart":{},"localized-footer":{"filename":"FOOTER.md"},"lunr":{"maxIndexSize":1000000,"ignoreSpecialCharacters":false},"katex":{},"fontsettings":{"theme":"white","family":"sans","size":2,"font":"sans"},"highlight":{},"codeblock-filename":{},"sitemap":{"hostname":"http://hivemall.incubator.apache.org/"},"theme-api":{"languages":[],"split":false,"theme":"dark"},"sharing":{"facebook":true,"twitter":true,"google":false,"weibo":false,"instapaper":false,"vk":false,"all":["facebook","google","twitter","weibo","instapaper"]},"edit-link":{"label":"Edit","base":"https://github.com/apache/incubator-hivemall/docs/gitbook"},"theme-default":{"styles":{"website":"styles/website.css","pdf":"styles/pdf.css","epub":"styles/epub.css","mobi":"styles/mobi.css","ebook":"styles/ebook.css","print":"styles/print.css"},"showLevel":true},"anchorjs":{"selector":"h1,h2,h3, *:not(.callout) > h4,h5"},"toggle-chapters":{},"expandable-chapters":{}},"theme":"default","pdf":{"pageNumbers":true,"fontSize":12,"fontFamily":"Arial","paperSize":"a4","chapterMark":"pagebreak","pageBreaksBefore":"/","margin":{"right":62,"left":62,"top":56,"bottom":56}},"structure":{"langs":"LANGS.md","readme":"README.md","glossary":"GLOSSARY.md","summary":"SUMMARY.md"},"variables":{},"title":"Hivemall User Manual","links":{"sidebar":{"<i class=\"fa fa-home\"></i> Home":"http://hivemall.incubator.apache.org/"}},"gitbook":"3.x.x","description":"User Manual for Apache Hivemall"},"file":{"path":"misc/topk.md","mtime":"2016-11-16T08:32:05.000Z","type":"markdown"},"gitbook":{"version":"3.2.2","time":"2016-11-16T08:36:45.392Z"},"basePath":"..","book":{"language":""}}); + gitbook.page.hasChanged({"page":{"title":"Efficient Top-K query processing","level":"2.2","depth":1,"next":{"title":"English/Japanese Text Tokenizer","level":"2.3","depth":1,"path":"misc/tokenizer.md","ref":"misc/tokenizer.md","articles":[]},"previous":{"title":"List of generic Hivemall functions","level":"2.1","depth":1,"path":"misc/generic_funcs.md","ref":"misc/generic_funcs.md","articles":[]},"dir":"ltr"},"config":{"plugins":["theme-api","edit-link","github","splitter","sitemap","etoc","callouts","toggle-chapters","anchorjs","codeblock-filename","expandable-chapters","multipart","codeblock-filename","katex","emphasize","localized-footer"],"styles":{"website":"styles/website.css","pdf":"styles/pdf.css","epub":"styles/epub.css","mobi":"styles/mobi.css","ebook":"styles/ebook.css","print":"styles/print.css"},"pluginsConfig":{"emphasize":{},"callouts":{},"etoc":{"maxdepth":3,"mindepth":1,"notoc":true},"github":{"url":"https://github.com/apache/incubator-hivemall/"},"splitt er":{},"search":{},"downloadpdf":{"base":"https://github.com/apache/incubator-hivemall/docs/gitbook","label":"PDF","multilingual":false},"multipart":{},"localized-footer":{"filename":"FOOTER.md"},"lunr":{"maxIndexSize":1000000,"ignoreSpecialCharacters":false},"katex":{},"fontsettings":{"theme":"white","family":"sans","size":2,"font":"sans"},"highlight":{},"codeblock-filename":{},"sitemap":{"hostname":"http://hivemall.incubator.apache.org/"},"theme-api":{"languages":[],"split":false,"theme":"dark"},"sharing":{"facebook":true,"twitter":true,"google":false,"weibo":false,"instapaper":false,"vk":false,"all":["facebook","google","twitter","weibo","instapaper"]},"edit-link":{"label":"Edit","base":"https://github.com/apache/incubator-hivemall/docs/gitbook"},"theme-default":{"styles":{"website":"styles/website.css","pdf":"styles/pdf.css","epub":"styles/epub.css","mobi":"styles/mobi.css","ebook":"styles/ebook.css","print":"styles/print.css"},"showLevel":true},"anchorjs":{"selector":"h1,h2,h3, *:not(.callout) > h4,h5"},"toggle-chapters":{},"expandable-chapters":{}},"theme":"default","pdf":{"pageNumbers":true,"fontSize":12,"fontFamily":"Arial","paperSize":"a4","chapterMark":"pagebreak","pageBreaksBefore":"/","margin":{"right":62,"left":62,"top":56,"bottom":56}},"structure":{"langs":"LANGS.md","readme":"README.md","glossary":"GLOSSARY.md","summary":"SUMMARY.md"},"variables":{},"title":"Hivemall User Manual","links":{"sidebar":{"<i class=\"fa fa-home\"></i> Home":"http://hivemall.incubator.apache.org/"}},"gitbook":"3.x.x","description":"User Manual for Apache Hivemall"},"file":{"path":"misc/topk.md","mtime":"2016-11-17T09:58:26.000Z","type":"markdown"},"gitbook":{"version":"3.2.2","time":"2016-11-17T12:16:14.647Z"},"basePath":"..","book":{"language":""}}); }); </script> </div>
