Github user takuti commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/158#discussion_r213941931 --- Diff: docs/gitbook/getting_started/tutorial.md --- @@ -0,0 +1,493 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall + +<!-- toc --> + +## What is Hivemall? + +[Apache Hive](https://hive.apache.org/) is a data warehousing solution that enables us to process large-scale data in the form of SQL easily. Assume that you have a table named `purchase_history` which can be artificially created as: + +```sql +create table if not exists purchase_history +(id bigint, day_of_week string, price int, category string, label int) +; +``` + + +```sql +insert overwrite table purchase_history +select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, "book" as category, 1 as label +union all +select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as price, "sports" as category, 0 as label +union all +select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as price, "entertainment" as category, 0 as label +union all +select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, "food" as category, 0 as label +union all +select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as price, "electronics" as category, 1 as label +; +``` + +The syntax of Hive queries, namely **HiveQL**, is very similar to SQL: + +```sql +select count(1) from purchase_log +``` + +> 5 + +[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a collection of user-defined functions (UDFs) for HiveQL which is strongly optimized for machine learning (ML) and data science. To give an example, you can efficiently build a logistic regression model with the stochastic gradient descent (SGD) optimization by issuing the following ~10 lines of query: + +```sql +SELECT + train_classifier( + features, + label, + '-loss_function logloss -optimizer SGD' + ) as (feature, weight) +FROM + training +; +``` + + +On the TD console, Hivemall function [`hivemall_version()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#others) shows current Hivemall version that is available on TD, for example: + +```sql +select hivemall_version() +``` + +> "0.5.1-20180703-SNAPSHOT-31924dc" (as of July 23, 2018) + +Below we list ML and relevant problems that Hivemall and TD can solve: + +- Binary and multi-class classification --- End diff -- Just put links as much as you can :) - [Recommendation](http://hivemall.incubator.apache.org/userguide/recommend/cf.html) - [Anomaly detection](http://hivemall.incubator.apache.org/userguide/anomaly/lof.html) - [Clustering](http://hivemall.incubator.apache.org/userguide/clustering/lda.html) - [NLP](http://hivemall.incubator.apache.org/userguide/misc/tokenizer.html) - etc...
---