Github user chezou commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/158#discussion_r213962767
  
    --- Diff: docs/gitbook/getting_started/tutorial.md ---
    @@ -0,0 +1,493 @@
    +<!--
    +  Licensed to the Apache Software Foundation (ASF) under one
    +  or more contributor license agreements.  See the NOTICE file
    +  distributed with this work for additional information
    +  regarding copyright ownership.  The ASF licenses this file
    +  to you under the Apache License, Version 2.0 (the
    +  "License"); you may not use this file except in compliance
    +  with the License.  You may obtain a copy of the License at
    +
    +    http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing,
    +  software distributed under the License is distributed on an
    +  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +  KIND, either express or implied.  See the License for the
    +  specific language governing permissions and limitations
    +  under the License.
    +-->
    +
    +# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
    +
    +<!-- toc -->
    +
    +## What is Hivemall?
    +
    +[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
    +
    +```sql
    +create table if not exists purchase_history
    +(id bigint, day_of_week string, price int, category string, label int)
    +;
    +```
    +
    +
    +```sql
    +insert overwrite table purchase_history
    +select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
    +union all
    +select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
    +union all
    +select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
    +union all
    +select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
    +union all
    +select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
    +;
    +```
    +
    +The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
    +
    +```sql
    +select count(1) from purchase_log
    +```
    +
    +> 5
    +
    +[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
    +
    +```sql
    +SELECT
    +  train_classifier(
    +    features,
    +    label,
    +    '-loss_function logloss -optimizer SGD'
    +  ) as (feature, weight)
    +FROM
    +  training
    +;
    +```
    +
    +
    +On the TD console, Hivemall function 
[`hivemall_version()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#others)
 shows current Hivemall version that is available on TD, for example:
    +
    +```sql
    +select hivemall_version()
    +```
    +
    +> "0.5.1-20180703-SNAPSHOT-31924dc" (as of July 23, 2018)
    +
    +Below we list ML and relevant problems that Hivemall and TD can solve:
    +
    +- Binary and multi-class classification
    +- Regression
    +- Recommendation
    +- Anomaly detection
    +- Natural language processing
    +- Clustering (i.e., topic modeling)
    +- Data sketching
    +- Evaluation
    +
    +Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) 
would be helpful to understand more about an overview of Hivemall.
    +
    +This tutorial explains the basic usage of Hivemall with examples of 
supervised learning of simple regressor and binary classifier.
    +
    +## Binary classification
    +
    +Imagine a scenario that we like to build a binary classifier from the mock 
`purchase_history` data and predict unforeseen purchases to conduct a new 
campaign effectively:
    +
    +| day\_of\_week | gender | price | category | label |
    +|:---:|:---:|:---:|:---:|:---|
    +|Saturday | male | 600 | book | 1 |
    +|Friday | female | 4800 | sports | 0 |
    +|Friday | other | 18000  | entertainment | 0 |
    +|Thursday | male | 200 | food | 0 |
    +|Wednesday | female | 1000 | electronics | 1 |
    +
    +Use Hivemall 
[`train_classifier()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#binary-classification)
 UDF to tackle the problem as follows.
    +
    +### Step 1. Feature representation
    +
    +First of all, we have to convert the records into pairs of the feature 
vector and corresponding target value. Here, Hivemall requires you to represent 
input features in a specific format.
    +
    +To be more precise, Hivemall represents single feature in a concatenation 
of **index** (i.e., **name**) and its **value**:
    +
    +- Quantitative feature: `<index>:<value>`
    +  - e.g., `price:600.0`
    +- Categorical feature: `<index>#<value>`
    +  - e.g., `gender#male`
    +
    +Each of those features is a string value in Hive, and "feature vector" 
means an array of string values like:
    +
    +```
    +["price:600.0", "day of week#Saturday", "gender#male", "category#book"]
    +```
    +
    +Therefore, what we first need to do is to convert the records into an 
array of feature strings, and Hivemall functions 
[`quantitative_features()`](https://hivemall.incubator.apache.org/userguide/getting_started/input-format.html#quantitative-features),
 
[`categorical_features()`](https://hivemall.incubator.apache.org/userguide/getting_started/input-format.html#categorical-features)
 and 
[`array_concat()`](https://hivemall.incubator.apache.org/userguide/misc/generic_funcs.html#array)
 provide a simple way to create the pairs of feature vector and target value:
    +
    +```sql
    +create table if not exists training
    +(id bigint, features array<string>, label int)
    +;
    +```
    +
    +```sql
    +insert overwrite table training
    +select
    +  id,
    +  array_concat( -- concatenate two arrays of quantitative and categorical 
features into single array
    +    quantitative_features(
    +      array("price"), -- quantitative feature names
    +      price -- corresponding column names
    +    ),
    +    categorical_features(
    +      array("day of week", "gender", "category"), -- categorical feature 
names
    +      day_of_week, gender, category -- corresponding column names
    +    )
    +  ) as features,
    +  label
    +from
    +  purchase_history
    +;
    +```
    +
    +|id | features |  label |
    +|:---:|:---|:---|
    +|1 |["price:600.0","day of week#Saturday","gender#male","category#book"] | 
1 |
    +|2 |["price:4800.0","day of 
week#Friday","gender#female","category#sports"] |  0 |
    +|3 |["price:18000.0","day of 
week#Friday","gender#other","category#entertainment"]| 0 |
    +|4 |["price:200.0","day of week#Thursday","gender#male","category#food"] | 
0 |
    +|5 |["price:1000.0","day of 
week#Wednesday","gender#female","category#electronics"]| 1 |
    +
    +The output table `training` will be directly used as an input to 
Hivemall's ML functions in the next step.
    --- End diff --
    
    as well as above. Will use CTAS.


---

Reply via email to