incubator-hivemall git commit: Close #77: [HIVEMALL-98] Feature binning documents

myui Mon, 08 May 2017 01:41:01 -0700

Repository: incubator-hivemall
Updated Branches:
  refs/heads/master a31d0aab3 -> 211c28036



Close #77: [HIVEMALL-98] Feature binning documents


Project: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/repo
Commit: 
http://git-wip-us.apache.org/repos/asf/incubator-hivemall/commit/211c2803
Tree: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/tree/211c2803
Diff: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/diff/211c2803

Branch: refs/heads/master
Commit: 211c28036e4a7e7549b3e21fae723f207d85aa09
Parents: a31d0aa
Author: Ryuichi Ito <[email protected]>
Authored: Mon May 8 17:39:44 2017 +0900
Committer: myui <[email protected]>
Committed: Mon May 8 17:39:44 2017 +0900

----------------------------------------------------------------------
 docs/gitbook/SUMMARY.md                         |   9 +-
 docs/gitbook/ft_engineering/binning.md          | 162 +++++++++++++++++++
 .../gitbook/ft_engineering/feature_selection.md | 155 ------------------
 docs/gitbook/ft_engineering/scaling.md          |   4 +-
 docs/gitbook/ft_engineering/selection.md        | 155 ++++++++++++++++++
 docs/gitbook/ft_engineering/vectorization.md    |  61 +++++++
 docs/gitbook/ft_engineering/vectorizer.md       |  61 -------
 7 files changed, 385 insertions(+), 222 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/211c2803/docs/gitbook/SUMMARY.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/SUMMARY.md b/docs/gitbook/SUMMARY.md
index 3d035d7..809a548 100644
--- a/docs/gitbook/SUMMARY.md
+++ b/docs/gitbook/SUMMARY.md
@@ -55,14 +55,13 @@
 
 * [Feature Scaling](ft_engineering/scaling.md)
 * [Feature Hashing](ft_engineering/hashing.md)
-* [TF-IDF calculation](ft_engineering/tfidf.md)
-
+* [Feature Selection](ft_engineering/selection.md)
+* [Feature Binning](ft_engineering/binning.md)
+* [TF-IDF Calculation](ft_engineering/tfidf.md)
 * [FEATURE TRANSFORMATION](ft_engineering/ft_trans.md)
-    * [Vectorize Features](ft_engineering/vectorizer.md)
+    * [Feature Vectorization](ft_engineering/vectorization.md)
     * [Quantify non-number features](ft_engineering/quantify.md)
 
-* [Feature selection](ft_engineering/feature_selection.md)
-
 ## Part IV - Evaluation
 
 * [Statistical evaluation of a prediction model](eval/stat_eval.md)

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/211c2803/docs/gitbook/ft_engineering/binning.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/ft_engineering/binning.md 
b/docs/gitbook/ft_engineering/binning.md
new file mode 100644
index 0000000..cd1ecbb
--- /dev/null
+++ b/docs/gitbook/ft_engineering/binning.md
@@ -0,0 +1,162 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+Feature binning is a method of dividing quantitative variables into 
categorical values.
+It groups quantitative values into a pre-defined number of bins.
+
+*Note: This feature is supported from Hivemall v0.5-rc.1 or later.*
+
+<!-- toc -->
+
+# Usage
+
+Prepare sample data (*users* table) first as follows:
+
+``` sql
+CREATE TABLE users (
+  name string, age int, gender string
+);
+
+INSERT INTO users VALUES
+  ('Jacob', 20, 'Male'),
+  ('Mason', 22, 'Male'),
+  ('Sophia', 35, 'Female'),
+  ('Ethan', 55, 'Male'),
+  ('Emma', 15, 'Female'),
+  ('Noah', 46, 'Male'),
+  ('Isabella', 20, 'Female');
+```
+
+## A. Feature Vector trasformation by applying Feature Binning
+
+``` sql
+WITH t AS (
+  SELECT
+    array_concat(
+      categorical_features(
+        array('name', 'gender'),
+       name, gender
+      ),
+      quantitative_features(
+       array('age'),
+       age
+      )
+    ) AS features
+  FROM
+    users
+),
+bins AS (
+  SELECT
+    map('age', build_bins(age, 3)) AS quantiles_map
+  FROM
+    users
+)
+SELECT
+  feature_binning(features, quantiles_map) AS features
+FROM
+  t CROSS JOIN bins;
+```
+
+*Result*
+
+| features: `array<features::string>` |
+| :-: |
+| ["name#Jacob","gender#Male","age:1"] |
+| ["name#Mason","gender#Male","age:1"] |
+| ["name#Sophia","gender#Female","age:2"] |
+| ["name#Ethan","gender#Male","age:2"] |
+| ["name#Emma","gender#Female","age:0"] |
+| ["name#Noah","gender#Male","age:2"] |
+| ["name#Isabella","gender#Female","age:1"] |
+
+
+## B. Get a mapping table by Feature Binning
+
+```sql
+WITH bins AS (
+  SELECT build_bins(age, 3) AS quantiles
+  FROM users
+)
+SELECT
+  age, feature_binning(age, quantiles) AS bin
+FROM
+  users CROSS JOIN bins;
+```
+
+*Result*
+
+| age:` int` | bin: `int` |
+|:-:|:-:|
+| 20 | 1 |
+| 22 | 1 |
+| 35 | 2 |
+| 55 | 2 |
+| 15 | 0 |
+| 46 | 2 |
+| 20 | 1 |
+
+# Function Signature
+
+## [UDAF] `build_bins(weight, num_of_bins[, auto_shrink])`
+
+### Input
+
+| weight: int&#124;bigint&#124;float&#124;double | num\_of\_bins: `int` | 
[auto\_shrink: `boolean` = false] |
+| :-: | :-: | :-: |
+| weight | 2 <= | behavior when separations are repeated: T=\>skip, 
F=\>exception |
+
+### Output
+
+| quantiles: `array<double>` |
+| :-: |
+| array of separation value |
+
+> #### Note
+> There is the possibility quantiles are repeated because of too many 
`num_of_bins` or too few data.
+> If `auto_shrink` is true, skip duplicated quantiles. If not, throw an 
exception.
+
+## [UDF] `feature_binning(features, quantiles_map)/(weight, quantiles)`
+
+### Variation: A
+
+#### Input 
+
+| features: `array<features::string>` | quantiles\_map: `map<string, 
array<double>>` |
+| :-: | :-: |
+| serialized feature | entry:: key: col name, val: quantiles |
+
+#### Output
+
+| features: `array<feature::string>` |
+| :-: |
+| serialized and binned features |
+
+### Variation: B
+
+#### Input
+
+| weight: int&#124;bigint&#124;float&#124;double | quantiles: `array<double>` |
+| :-: | :-: |
+| weight | array of separation value |
+
+#### Output
+
+| bin: `int` |
+| :-: |
+| categorical value (bin ID) |

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/211c2803/docs/gitbook/ft_engineering/feature_selection.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/ft_engineering/feature_selection.md 
b/docs/gitbook/ft_engineering/feature_selection.md
deleted file mode 100644
index b19ba56..0000000
--- a/docs/gitbook/ft_engineering/feature_selection.md
+++ /dev/null
@@ -1,155 +0,0 @@
-<!--
-  Licensed to the Apache Software Foundation (ASF) under one
-  or more contributor license agreements.  See the NOTICE file
-  distributed with this work for additional information
-  regarding copyright ownership.  The ASF licenses this file
-  to you under the Apache License, Version 2.0 (the
-  "License"); you may not use this file except in compliance
-  with the License.  You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-  Unless required by applicable law or agreed to in writing,
-  software distributed under the License is distributed on an
-  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-  KIND, either express or implied.  See the License for the
-  specific language governing permissions and limitations
-  under the License.
--->
-
-[Feature Selection](https://en.wikipedia.org/wiki/Feature_selection) is the 
process of selecting a subset of relevant features for use in model 
construction. 
-
-It is a useful technique to 1) improve prediction results by omitting 
redundant features, 2) to shorten training time, and 3) to know important 
features for prediction.
-
-*Note: This feature is supported from Hivemall v0.5-rc.1 or later.*
-
-<!-- toc -->
-
-# Supported Feature Selection algorithms
-
-* Chi-square (Chi2)
-    * In statistics, the $$\chi^2$$ test is applied to test the independence 
of two even events. Chi-square statistics between every feature variable and 
the target variable can be applied to Feature Selection. Refer [this 
article](http://nlp.stanford.edu/IR-book/html/htmledition/feature-selectionchi2-feature-selection-1.html)
 for Mathematical details.
-* Signal Noise Ratio (SNR)
-    * The Signal Noise Ratio (SNR) is a univariate feature ranking metric, 
which can be used as a feature selection criterion for binary classification 
problems. SNR is defined as $$|\mu_{1} - \mu_{2}| / (\sigma_{1} + 
\sigma_{2})$$, where $$\mu_{k}$$ is the mean value of the variable in classes 
$$k$$, and $$\sigma_{k}$$ is the standard deviations of the variable in classes 
$$k$$. Clearly, features with larger SNR are useful for classification.
-
-# Usage
-
-##  Feature Selection based on Chi-square test
-
-``` sql
-CREATE TABLE input (
-  X array<double>, -- features
-  Y array<int> -- binarized label
-);
- 
-set hivevar:k=2;
-
-WITH stats AS (
-  SELECT
-    transpose_and_dot(Y, X) AS observed, -- array<array<double>>, shape = 
(n_classes, n_features)
-    array_sum(X) AS feature_count, -- n_features col vector, shape = (1, 
array<double>)
-    array_avg(Y) AS class_prob -- n_class col vector, shape = (1, 
array<double>)
-  FROM
-    input
-),
-test AS (
-  SELECT
-    transpose_and_dot(class_prob, feature_count) AS expected -- 
array<array<double>>, shape = (n_class, n_features)
-  FROM
-    stats
-),
-chi2 AS (
-  SELECT
-    chi2(r.observed, l.expected) AS v -- struct<array<double>, array<double>>, 
each shape = (1, n_features)
-  FROM
-    test l
-    CROSS JOIN stats r
-)
-SELECT
-  select_k_best(l.X, r.v.chi2, ${k}) as features -- top-k feature selection 
based on chi2 score
-FROM
-  input l
-  CROSS JOIN chi2 r;
-```
-
-## Feature Selection based on Signal Noise Ratio (SNR)
-
-``` sql
-CREATE TABLE input (
-  X array<double>, -- features
-  Y array<int> -- binarized label
-);
-
-set hivevar:k=2;
-
-WITH snr AS (
-  SELECT snr(X, Y) AS snr -- aggregated SNR as array<double>, shape = (1, 
#features)
-  FROM input
-)
-SELECT 
-  select_k_best(X, snr, ${k}) as features
-FROM
-  input
-  CROSS JOIN snr;
-```
-
-# Function signatures
-
-### [UDAF] `transpose_and_dot(X::array<number>, 
Y::array<number>)::array<array<double>>`
-
-##### Input
-
-| `array<number>` X | `array<number>` Y |
-| :-: | :-: |
-| a row of matrix | a row of matrix |
-
-##### Output
-
-| `array<array<double>>` dot product |
-| :-: |
-| `dot(X.T, Y)` of shape = (X.#cols, Y.#cols) |
-
-### [UDF] `select_k_best(X::array<number>, importance_list::array<number>, 
k::int)::array<double>`
-
-##### Input
-
-| `array<number>` X | `array<number>` importance_list | `int` k |
-| :-: | :-: | :-: |
-| feature vector | importance of each feature | the number of features to be 
selected |
-
-##### Output
-
-| `array<array<double>>` k-best features |
-| :-: |
-| top-k elements from feature vector `X` based on importance list |
-
-### [UDF] `chi2(observed::array<array<number>>, 
expected::array<array<number>>)::struct<array<double>, array<double>>`
-
-##### Input
-
-| `array<number>` observed | `array<number>` expected |
-| :-: | :-: |
-| observed features | expected features `dot(class_prob.T, feature_count)` |
-
-Both of `observed` and `expected` have a shape `(#classes, #features)`
-
-##### Output
-
-| `struct<array<double>, array<double>>` importance_list |
-| :-: |
-| chi2-value and p-value for each feature |
-
-### [UDAF] `snr(X::array<number>, Y::array<int>)::array<double>`
-
-##### Input
-
-| `array<number>` X | `array<int>` Y |
-| :-: | :-: |
-| feature vector | one hot label |
-
-##### Output
-
-| `array<double>` importance_list |
-| :-: |
-| Signal Noise Ratio for each feature |
-

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/211c2803/docs/gitbook/ft_engineering/scaling.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/ft_engineering/scaling.md 
b/docs/gitbook/ft_engineering/scaling.md
index 26d82bd..7f388d6 100644
--- a/docs/gitbook/ft_engineering/scaling.md
+++ b/docs/gitbook/ft_engineering/scaling.md
@@ -16,7 +16,9 @@
   specific language governing permissions and limitations
   under the License.
 -->
-        
+
+<!-- toc -->
+
 # Min-Max Normalization
 http://en.wikipedia.org/wiki/Feature_scaling#Rescaling
 ```sql

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/211c2803/docs/gitbook/ft_engineering/selection.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/ft_engineering/selection.md 
b/docs/gitbook/ft_engineering/selection.md
new file mode 100644
index 0000000..b19ba56
--- /dev/null
+++ b/docs/gitbook/ft_engineering/selection.md
@@ -0,0 +1,155 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+[Feature Selection](https://en.wikipedia.org/wiki/Feature_selection) is the 
process of selecting a subset of relevant features for use in model 
construction. 
+
+It is a useful technique to 1) improve prediction results by omitting 
redundant features, 2) to shorten training time, and 3) to know important 
features for prediction.
+
+*Note: This feature is supported from Hivemall v0.5-rc.1 or later.*
+
+<!-- toc -->
+
+# Supported Feature Selection algorithms
+
+* Chi-square (Chi2)
+    * In statistics, the $$\chi^2$$ test is applied to test the independence 
of two even events. Chi-square statistics between every feature variable and 
the target variable can be applied to Feature Selection. Refer [this 
article](http://nlp.stanford.edu/IR-book/html/htmledition/feature-selectionchi2-feature-selection-1.html)
 for Mathematical details.
+* Signal Noise Ratio (SNR)
+    * The Signal Noise Ratio (SNR) is a univariate feature ranking metric, 
which can be used as a feature selection criterion for binary classification 
problems. SNR is defined as $$|\mu_{1} - \mu_{2}| / (\sigma_{1} + 
\sigma_{2})$$, where $$\mu_{k}$$ is the mean value of the variable in classes 
$$k$$, and $$\sigma_{k}$$ is the standard deviations of the variable in classes 
$$k$$. Clearly, features with larger SNR are useful for classification.
+
+# Usage
+
+##  Feature Selection based on Chi-square test
+
+``` sql
+CREATE TABLE input (
+  X array<double>, -- features
+  Y array<int> -- binarized label
+);
+ 
+set hivevar:k=2;
+
+WITH stats AS (
+  SELECT
+    transpose_and_dot(Y, X) AS observed, -- array<array<double>>, shape = 
(n_classes, n_features)
+    array_sum(X) AS feature_count, -- n_features col vector, shape = (1, 
array<double>)
+    array_avg(Y) AS class_prob -- n_class col vector, shape = (1, 
array<double>)
+  FROM
+    input
+),
+test AS (
+  SELECT
+    transpose_and_dot(class_prob, feature_count) AS expected -- 
array<array<double>>, shape = (n_class, n_features)
+  FROM
+    stats
+),
+chi2 AS (
+  SELECT
+    chi2(r.observed, l.expected) AS v -- struct<array<double>, array<double>>, 
each shape = (1, n_features)
+  FROM
+    test l
+    CROSS JOIN stats r
+)
+SELECT
+  select_k_best(l.X, r.v.chi2, ${k}) as features -- top-k feature selection 
based on chi2 score
+FROM
+  input l
+  CROSS JOIN chi2 r;
+```
+
+## Feature Selection based on Signal Noise Ratio (SNR)
+
+``` sql
+CREATE TABLE input (
+  X array<double>, -- features
+  Y array<int> -- binarized label
+);
+
+set hivevar:k=2;
+
+WITH snr AS (
+  SELECT snr(X, Y) AS snr -- aggregated SNR as array<double>, shape = (1, 
#features)
+  FROM input
+)
+SELECT 
+  select_k_best(X, snr, ${k}) as features
+FROM
+  input
+  CROSS JOIN snr;
+```
+
+# Function signatures
+
+### [UDAF] `transpose_and_dot(X::array<number>, 
Y::array<number>)::array<array<double>>`
+
+##### Input
+
+| `array<number>` X | `array<number>` Y |
+| :-: | :-: |
+| a row of matrix | a row of matrix |
+
+##### Output
+
+| `array<array<double>>` dot product |
+| :-: |
+| `dot(X.T, Y)` of shape = (X.#cols, Y.#cols) |
+
+### [UDF] `select_k_best(X::array<number>, importance_list::array<number>, 
k::int)::array<double>`
+
+##### Input
+
+| `array<number>` X | `array<number>` importance_list | `int` k |
+| :-: | :-: | :-: |
+| feature vector | importance of each feature | the number of features to be 
selected |
+
+##### Output
+
+| `array<array<double>>` k-best features |
+| :-: |
+| top-k elements from feature vector `X` based on importance list |
+
+### [UDF] `chi2(observed::array<array<number>>, 
expected::array<array<number>>)::struct<array<double>, array<double>>`
+
+##### Input
+
+| `array<number>` observed | `array<number>` expected |
+| :-: | :-: |
+| observed features | expected features `dot(class_prob.T, feature_count)` |
+
+Both of `observed` and `expected` have a shape `(#classes, #features)`
+
+##### Output
+
+| `struct<array<double>, array<double>>` importance_list |
+| :-: |
+| chi2-value and p-value for each feature |
+
+### [UDAF] `snr(X::array<number>, Y::array<int>)::array<double>`
+
+##### Input
+
+| `array<number>` X | `array<int>` Y |
+| :-: | :-: |
+| feature vector | one hot label |
+
+##### Output
+
+| `array<double>` importance_list |
+| :-: |
+| Signal Noise Ratio for each feature |
+

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/211c2803/docs/gitbook/ft_engineering/vectorization.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/ft_engineering/vectorization.md 
b/docs/gitbook/ft_engineering/vectorization.md
new file mode 100644
index 0000000..21fcea7
--- /dev/null
+++ b/docs/gitbook/ft_engineering/vectorization.md
@@ -0,0 +1,61 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+        
+## Feature Vectorization
+
+`array<string> vectorize_feature(array<string> featureNames, ...)` is useful 
to generate a feature vector for each row, from a table.
+
+```sql
+select vectorize_features(array("a","b"),"0.2","0.3") from dual;
+>["a:0.2","b:0.3"]
+
+-- avoid zero weight
+select vectorize_features(array("a","b"),"0.2",0) from dual;
+> ["a:0.2"]
+
+-- true boolean value is treated as 1.0 (a categorical value w/ its column 
name)
+select vectorize_features(array("a","b","bool"),0.2,0.3,true) from dual;
+> ["a:0.2","b:0.3","bool:1.0"]
+
+-- an example to generate feature vectors from table
+select * from dual;
+> 1                                         
+select vectorize_features(array("a"),*) from dual;
+> ["a:1.0"]
+
+-- has categorical feature
+select vectorize_features(array("a","b","wheather"),"0.2","0.3","sunny") from 
dual;
+> ["a:0.2","b:0.3","whether#sunny"]
+```
+
+```sql
+select
+  id,
+  vectorize_features(
+    
array("age","job","marital","education","default","balance","housing","loan","contact","day","month","duration","campaign","pdays","previous","poutcome"),
 
+    
age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
+  ) as features,
+  y
+from
+  train
+limit 2;
+```
+
+> 1       
["age:39.0","job#blue-collar","marital#married","education#secondary","default#no","balance:1756.0","housing#yes","loan#no","contact#cellular","day:3.0","month#apr","duration:939.0","campaign:1.0","pdays:-1.0","poutcome#unknown"]
   1
+> 2       
["age:51.0","job#entrepreneur","marital#married","education#primary","default#no","balance:1443.0","housing#no","loan#no","contact#cellular","day:18.0","month#feb","duration:172.0","campaign:10.0","pdays:-1.0","poutcome#unknown"]
   1

http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/211c2803/docs/gitbook/ft_engineering/vectorizer.md
----------------------------------------------------------------------
diff --git a/docs/gitbook/ft_engineering/vectorizer.md 
b/docs/gitbook/ft_engineering/vectorizer.md
deleted file mode 100644
index 59038d1..0000000
--- a/docs/gitbook/ft_engineering/vectorizer.md
+++ /dev/null
@@ -1,61 +0,0 @@
-<!--
-  Licensed to the Apache Software Foundation (ASF) under one
-  or more contributor license agreements.  See the NOTICE file
-  distributed with this work for additional information
-  regarding copyright ownership.  The ASF licenses this file
-  to you under the Apache License, Version 2.0 (the
-  "License"); you may not use this file except in compliance
-  with the License.  You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-  Unless required by applicable law or agreed to in writing,
-  software distributed under the License is distributed on an
-  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-  KIND, either express or implied.  See the License for the
-  specific language governing permissions and limitations
-  under the License.
--->
-        
-## Feature Vectorizer
-
-`array<string> vectorize_feature(array<string> featureNames, ...)` is useful 
to generate a feature vector for each row, from a table.
-
-```sql
-select vectorize_features(array("a","b"),"0.2","0.3") from dual;
->["a:0.2","b:0.3"]
-
--- avoid zero weight
-select vectorize_features(array("a","b"),"0.2",0) from dual;
-> ["a:0.2"]
-
--- true boolean value is treated as 1.0 (a categorical value w/ its column 
name)
-select vectorize_features(array("a","b","bool"),0.2,0.3,true) from dual;
-> ["a:0.2","b:0.3","bool:1.0"]
-
--- an example to generate feature vectors from table
-select * from dual;
-> 1                                         
-select vectorize_features(array("a"),*) from dual;
-> ["a:1.0"]
-
--- has categorical feature
-select vectorize_features(array("a","b","wheather"),"0.2","0.3","sunny") from 
dual;
-> ["a:0.2","b:0.3","whether#sunny"]
-```
-
-```sql
-select
-  id,
-  vectorize_features(
-    
array("age","job","marital","education","default","balance","housing","loan","contact","day","month","duration","campaign","pdays","previous","poutcome"),
 
-    
age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
-  ) as features,
-  y
-from
-  train
-limit 2;
-
-> 1       
["age:39.0","job#blue-collar","marital#married","education#secondary","default#no","balance:1756.0","housing#yes","loan#no","contact#cellular","day:3.0","month#apr","duration:939.0","campaign:1.0","pdays:-1.0","poutcome#unknown"]
   1
-> 2       
["age:51.0","job#entrepreneur","marital#married","education#primary","default#no","balance:1443.0","housing#no","loan#no","contact#cellular","day:18.0","month#feb","duration:172.0","campaign:10.0","pdays:-1.0","poutcome#unknown"]
   1
-```
\ No newline at end of file

incubator-hivemall git commit: Close #77: [HIVEMALL-98] Feature binning documents

Reply via email to