Repository: incubator-hivemall Updated Branches: refs/heads/master a31d0aab3 -> 211c28036
Close #77: [HIVEMALL-98] Feature binning documents Project: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/commit/211c2803 Tree: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/tree/211c2803 Diff: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/diff/211c2803 Branch: refs/heads/master Commit: 211c28036e4a7e7549b3e21fae723f207d85aa09 Parents: a31d0aa Author: Ryuichi Ito <[email protected]> Authored: Mon May 8 17:39:44 2017 +0900 Committer: myui <[email protected]> Committed: Mon May 8 17:39:44 2017 +0900 ---------------------------------------------------------------------- docs/gitbook/SUMMARY.md | 9 +- docs/gitbook/ft_engineering/binning.md | 162 +++++++++++++++++++ .../gitbook/ft_engineering/feature_selection.md | 155 ------------------ docs/gitbook/ft_engineering/scaling.md | 4 +- docs/gitbook/ft_engineering/selection.md | 155 ++++++++++++++++++ docs/gitbook/ft_engineering/vectorization.md | 61 +++++++ docs/gitbook/ft_engineering/vectorizer.md | 61 ------- 7 files changed, 385 insertions(+), 222 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/211c2803/docs/gitbook/SUMMARY.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/SUMMARY.md b/docs/gitbook/SUMMARY.md index 3d035d7..809a548 100644 --- a/docs/gitbook/SUMMARY.md +++ b/docs/gitbook/SUMMARY.md @@ -55,14 +55,13 @@ * [Feature Scaling](ft_engineering/scaling.md) * [Feature Hashing](ft_engineering/hashing.md) -* [TF-IDF calculation](ft_engineering/tfidf.md) - +* [Feature Selection](ft_engineering/selection.md) +* [Feature Binning](ft_engineering/binning.md) +* [TF-IDF Calculation](ft_engineering/tfidf.md) * [FEATURE TRANSFORMATION](ft_engineering/ft_trans.md) - * [Vectorize Features](ft_engineering/vectorizer.md) + * [Feature Vectorization](ft_engineering/vectorization.md) * [Quantify non-number features](ft_engineering/quantify.md) -* [Feature selection](ft_engineering/feature_selection.md) - ## Part IV - Evaluation * [Statistical evaluation of a prediction model](eval/stat_eval.md) http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/211c2803/docs/gitbook/ft_engineering/binning.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/ft_engineering/binning.md b/docs/gitbook/ft_engineering/binning.md new file mode 100644 index 0000000..cd1ecbb --- /dev/null +++ b/docs/gitbook/ft_engineering/binning.md @@ -0,0 +1,162 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +Feature binning is a method of dividing quantitative variables into categorical values. +It groups quantitative values into a pre-defined number of bins. + +*Note: This feature is supported from Hivemall v0.5-rc.1 or later.* + +<!-- toc --> + +# Usage + +Prepare sample data (*users* table) first as follows: + +``` sql +CREATE TABLE users ( + name string, age int, gender string +); + +INSERT INTO users VALUES + ('Jacob', 20, 'Male'), + ('Mason', 22, 'Male'), + ('Sophia', 35, 'Female'), + ('Ethan', 55, 'Male'), + ('Emma', 15, 'Female'), + ('Noah', 46, 'Male'), + ('Isabella', 20, 'Female'); +``` + +## A. Feature Vector trasformation by applying Feature Binning + +``` sql +WITH t AS ( + SELECT + array_concat( + categorical_features( + array('name', 'gender'), + name, gender + ), + quantitative_features( + array('age'), + age + ) + ) AS features + FROM + users +), +bins AS ( + SELECT + map('age', build_bins(age, 3)) AS quantiles_map + FROM + users +) +SELECT + feature_binning(features, quantiles_map) AS features +FROM + t CROSS JOIN bins; +``` + +*Result* + +| features: `array<features::string>` | +| :-: | +| ["name#Jacob","gender#Male","age:1"] | +| ["name#Mason","gender#Male","age:1"] | +| ["name#Sophia","gender#Female","age:2"] | +| ["name#Ethan","gender#Male","age:2"] | +| ["name#Emma","gender#Female","age:0"] | +| ["name#Noah","gender#Male","age:2"] | +| ["name#Isabella","gender#Female","age:1"] | + + +## B. Get a mapping table by Feature Binning + +```sql +WITH bins AS ( + SELECT build_bins(age, 3) AS quantiles + FROM users +) +SELECT + age, feature_binning(age, quantiles) AS bin +FROM + users CROSS JOIN bins; +``` + +*Result* + +| age:` int` | bin: `int` | +|:-:|:-:| +| 20 | 1 | +| 22 | 1 | +| 35 | 2 | +| 55 | 2 | +| 15 | 0 | +| 46 | 2 | +| 20 | 1 | + +# Function Signature + +## [UDAF] `build_bins(weight, num_of_bins[, auto_shrink])` + +### Input + +| weight: int|bigint|float|double | num\_of\_bins: `int` | [auto\_shrink: `boolean` = false] | +| :-: | :-: | :-: | +| weight | 2 <= | behavior when separations are repeated: T=\>skip, F=\>exception | + +### Output + +| quantiles: `array<double>` | +| :-: | +| array of separation value | + +> #### Note +> There is the possibility quantiles are repeated because of too many `num_of_bins` or too few data. +> If `auto_shrink` is true, skip duplicated quantiles. If not, throw an exception. + +## [UDF] `feature_binning(features, quantiles_map)/(weight, quantiles)` + +### Variation: A + +#### Input + +| features: `array<features::string>` | quantiles\_map: `map<string, array<double>>` | +| :-: | :-: | +| serialized feature | entry:: key: col name, val: quantiles | + +#### Output + +| features: `array<feature::string>` | +| :-: | +| serialized and binned features | + +### Variation: B + +#### Input + +| weight: int|bigint|float|double | quantiles: `array<double>` | +| :-: | :-: | +| weight | array of separation value | + +#### Output + +| bin: `int` | +| :-: | +| categorical value (bin ID) | http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/211c2803/docs/gitbook/ft_engineering/feature_selection.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/ft_engineering/feature_selection.md b/docs/gitbook/ft_engineering/feature_selection.md deleted file mode 100644 index b19ba56..0000000 --- a/docs/gitbook/ft_engineering/feature_selection.md +++ /dev/null @@ -1,155 +0,0 @@ -<!-- - Licensed to the Apache Software Foundation (ASF) under one - or more contributor license agreements. See the NOTICE file - distributed with this work for additional information - regarding copyright ownership. The ASF licenses this file - to you under the Apache License, Version 2.0 (the - "License"); you may not use this file except in compliance - with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, - software distributed under the License is distributed on an - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - KIND, either express or implied. See the License for the - specific language governing permissions and limitations - under the License. ---> - -[Feature Selection](https://en.wikipedia.org/wiki/Feature_selection) is the process of selecting a subset of relevant features for use in model construction. - -It is a useful technique to 1) improve prediction results by omitting redundant features, 2) to shorten training time, and 3) to know important features for prediction. - -*Note: This feature is supported from Hivemall v0.5-rc.1 or later.* - -<!-- toc --> - -# Supported Feature Selection algorithms - -* Chi-square (Chi2) - * In statistics, the $$\chi^2$$ test is applied to test the independence of two even events. Chi-square statistics between every feature variable and the target variable can be applied to Feature Selection. Refer [this article](http://nlp.stanford.edu/IR-book/html/htmledition/feature-selectionchi2-feature-selection-1.html) for Mathematical details. -* Signal Noise Ratio (SNR) - * The Signal Noise Ratio (SNR) is a univariate feature ranking metric, which can be used as a feature selection criterion for binary classification problems. SNR is defined as $$|\mu_{1} - \mu_{2}| / (\sigma_{1} + \sigma_{2})$$, where $$\mu_{k}$$ is the mean value of the variable in classes $$k$$, and $$\sigma_{k}$$ is the standard deviations of the variable in classes $$k$$. Clearly, features with larger SNR are useful for classification. - -# Usage - -## Feature Selection based on Chi-square test - -``` sql -CREATE TABLE input ( - X array<double>, -- features - Y array<int> -- binarized label -); - -set hivevar:k=2; - -WITH stats AS ( - SELECT - transpose_and_dot(Y, X) AS observed, -- array<array<double>>, shape = (n_classes, n_features) - array_sum(X) AS feature_count, -- n_features col vector, shape = (1, array<double>) - array_avg(Y) AS class_prob -- n_class col vector, shape = (1, array<double>) - FROM - input -), -test AS ( - SELECT - transpose_and_dot(class_prob, feature_count) AS expected -- array<array<double>>, shape = (n_class, n_features) - FROM - stats -), -chi2 AS ( - SELECT - chi2(r.observed, l.expected) AS v -- struct<array<double>, array<double>>, each shape = (1, n_features) - FROM - test l - CROSS JOIN stats r -) -SELECT - select_k_best(l.X, r.v.chi2, ${k}) as features -- top-k feature selection based on chi2 score -FROM - input l - CROSS JOIN chi2 r; -``` - -## Feature Selection based on Signal Noise Ratio (SNR) - -``` sql -CREATE TABLE input ( - X array<double>, -- features - Y array<int> -- binarized label -); - -set hivevar:k=2; - -WITH snr AS ( - SELECT snr(X, Y) AS snr -- aggregated SNR as array<double>, shape = (1, #features) - FROM input -) -SELECT - select_k_best(X, snr, ${k}) as features -FROM - input - CROSS JOIN snr; -``` - -# Function signatures - -### [UDAF] `transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>>` - -##### Input - -| `array<number>` X | `array<number>` Y | -| :-: | :-: | -| a row of matrix | a row of matrix | - -##### Output - -| `array<array<double>>` dot product | -| :-: | -| `dot(X.T, Y)` of shape = (X.#cols, Y.#cols) | - -### [UDF] `select_k_best(X::array<number>, importance_list::array<number>, k::int)::array<double>` - -##### Input - -| `array<number>` X | `array<number>` importance_list | `int` k | -| :-: | :-: | :-: | -| feature vector | importance of each feature | the number of features to be selected | - -##### Output - -| `array<array<double>>` k-best features | -| :-: | -| top-k elements from feature vector `X` based on importance list | - -### [UDF] `chi2(observed::array<array<number>>, expected::array<array<number>>)::struct<array<double>, array<double>>` - -##### Input - -| `array<number>` observed | `array<number>` expected | -| :-: | :-: | -| observed features | expected features `dot(class_prob.T, feature_count)` | - -Both of `observed` and `expected` have a shape `(#classes, #features)` - -##### Output - -| `struct<array<double>, array<double>>` importance_list | -| :-: | -| chi2-value and p-value for each feature | - -### [UDAF] `snr(X::array<number>, Y::array<int>)::array<double>` - -##### Input - -| `array<number>` X | `array<int>` Y | -| :-: | :-: | -| feature vector | one hot label | - -##### Output - -| `array<double>` importance_list | -| :-: | -| Signal Noise Ratio for each feature | - http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/211c2803/docs/gitbook/ft_engineering/scaling.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/ft_engineering/scaling.md b/docs/gitbook/ft_engineering/scaling.md index 26d82bd..7f388d6 100644 --- a/docs/gitbook/ft_engineering/scaling.md +++ b/docs/gitbook/ft_engineering/scaling.md @@ -16,7 +16,9 @@ specific language governing permissions and limitations under the License. --> - + +<!-- toc --> + # Min-Max Normalization http://en.wikipedia.org/wiki/Feature_scaling#Rescaling ```sql http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/211c2803/docs/gitbook/ft_engineering/selection.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/ft_engineering/selection.md b/docs/gitbook/ft_engineering/selection.md new file mode 100644 index 0000000..b19ba56 --- /dev/null +++ b/docs/gitbook/ft_engineering/selection.md @@ -0,0 +1,155 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +[Feature Selection](https://en.wikipedia.org/wiki/Feature_selection) is the process of selecting a subset of relevant features for use in model construction. + +It is a useful technique to 1) improve prediction results by omitting redundant features, 2) to shorten training time, and 3) to know important features for prediction. + +*Note: This feature is supported from Hivemall v0.5-rc.1 or later.* + +<!-- toc --> + +# Supported Feature Selection algorithms + +* Chi-square (Chi2) + * In statistics, the $$\chi^2$$ test is applied to test the independence of two even events. Chi-square statistics between every feature variable and the target variable can be applied to Feature Selection. Refer [this article](http://nlp.stanford.edu/IR-book/html/htmledition/feature-selectionchi2-feature-selection-1.html) for Mathematical details. +* Signal Noise Ratio (SNR) + * The Signal Noise Ratio (SNR) is a univariate feature ranking metric, which can be used as a feature selection criterion for binary classification problems. SNR is defined as $$|\mu_{1} - \mu_{2}| / (\sigma_{1} + \sigma_{2})$$, where $$\mu_{k}$$ is the mean value of the variable in classes $$k$$, and $$\sigma_{k}$$ is the standard deviations of the variable in classes $$k$$. Clearly, features with larger SNR are useful for classification. + +# Usage + +## Feature Selection based on Chi-square test + +``` sql +CREATE TABLE input ( + X array<double>, -- features + Y array<int> -- binarized label +); + +set hivevar:k=2; + +WITH stats AS ( + SELECT + transpose_and_dot(Y, X) AS observed, -- array<array<double>>, shape = (n_classes, n_features) + array_sum(X) AS feature_count, -- n_features col vector, shape = (1, array<double>) + array_avg(Y) AS class_prob -- n_class col vector, shape = (1, array<double>) + FROM + input +), +test AS ( + SELECT + transpose_and_dot(class_prob, feature_count) AS expected -- array<array<double>>, shape = (n_class, n_features) + FROM + stats +), +chi2 AS ( + SELECT + chi2(r.observed, l.expected) AS v -- struct<array<double>, array<double>>, each shape = (1, n_features) + FROM + test l + CROSS JOIN stats r +) +SELECT + select_k_best(l.X, r.v.chi2, ${k}) as features -- top-k feature selection based on chi2 score +FROM + input l + CROSS JOIN chi2 r; +``` + +## Feature Selection based on Signal Noise Ratio (SNR) + +``` sql +CREATE TABLE input ( + X array<double>, -- features + Y array<int> -- binarized label +); + +set hivevar:k=2; + +WITH snr AS ( + SELECT snr(X, Y) AS snr -- aggregated SNR as array<double>, shape = (1, #features) + FROM input +) +SELECT + select_k_best(X, snr, ${k}) as features +FROM + input + CROSS JOIN snr; +``` + +# Function signatures + +### [UDAF] `transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>>` + +##### Input + +| `array<number>` X | `array<number>` Y | +| :-: | :-: | +| a row of matrix | a row of matrix | + +##### Output + +| `array<array<double>>` dot product | +| :-: | +| `dot(X.T, Y)` of shape = (X.#cols, Y.#cols) | + +### [UDF] `select_k_best(X::array<number>, importance_list::array<number>, k::int)::array<double>` + +##### Input + +| `array<number>` X | `array<number>` importance_list | `int` k | +| :-: | :-: | :-: | +| feature vector | importance of each feature | the number of features to be selected | + +##### Output + +| `array<array<double>>` k-best features | +| :-: | +| top-k elements from feature vector `X` based on importance list | + +### [UDF] `chi2(observed::array<array<number>>, expected::array<array<number>>)::struct<array<double>, array<double>>` + +##### Input + +| `array<number>` observed | `array<number>` expected | +| :-: | :-: | +| observed features | expected features `dot(class_prob.T, feature_count)` | + +Both of `observed` and `expected` have a shape `(#classes, #features)` + +##### Output + +| `struct<array<double>, array<double>>` importance_list | +| :-: | +| chi2-value and p-value for each feature | + +### [UDAF] `snr(X::array<number>, Y::array<int>)::array<double>` + +##### Input + +| `array<number>` X | `array<int>` Y | +| :-: | :-: | +| feature vector | one hot label | + +##### Output + +| `array<double>` importance_list | +| :-: | +| Signal Noise Ratio for each feature | + http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/211c2803/docs/gitbook/ft_engineering/vectorization.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/ft_engineering/vectorization.md b/docs/gitbook/ft_engineering/vectorization.md new file mode 100644 index 0000000..21fcea7 --- /dev/null +++ b/docs/gitbook/ft_engineering/vectorization.md @@ -0,0 +1,61 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +## Feature Vectorization + +`array<string> vectorize_feature(array<string> featureNames, ...)` is useful to generate a feature vector for each row, from a table. + +```sql +select vectorize_features(array("a","b"),"0.2","0.3") from dual; +>["a:0.2","b:0.3"] + +-- avoid zero weight +select vectorize_features(array("a","b"),"0.2",0) from dual; +> ["a:0.2"] + +-- true boolean value is treated as 1.0 (a categorical value w/ its column name) +select vectorize_features(array("a","b","bool"),0.2,0.3,true) from dual; +> ["a:0.2","b:0.3","bool:1.0"] + +-- an example to generate feature vectors from table +select * from dual; +> 1 +select vectorize_features(array("a"),*) from dual; +> ["a:1.0"] + +-- has categorical feature +select vectorize_features(array("a","b","wheather"),"0.2","0.3","sunny") from dual; +> ["a:0.2","b:0.3","whether#sunny"] +``` + +```sql +select + id, + vectorize_features( + array("age","job","marital","education","default","balance","housing","loan","contact","day","month","duration","campaign","pdays","previous","poutcome"), + age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome + ) as features, + y +from + train +limit 2; +``` + +> 1 ["age:39.0","job#blue-collar","marital#married","education#secondary","default#no","balance:1756.0","housing#yes","loan#no","contact#cellular","day:3.0","month#apr","duration:939.0","campaign:1.0","pdays:-1.0","poutcome#unknown"] 1 +> 2 ["age:51.0","job#entrepreneur","marital#married","education#primary","default#no","balance:1443.0","housing#no","loan#no","contact#cellular","day:18.0","month#feb","duration:172.0","campaign:10.0","pdays:-1.0","poutcome#unknown"] 1 http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/211c2803/docs/gitbook/ft_engineering/vectorizer.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/ft_engineering/vectorizer.md b/docs/gitbook/ft_engineering/vectorizer.md deleted file mode 100644 index 59038d1..0000000 --- a/docs/gitbook/ft_engineering/vectorizer.md +++ /dev/null @@ -1,61 +0,0 @@ -<!-- - Licensed to the Apache Software Foundation (ASF) under one - or more contributor license agreements. See the NOTICE file - distributed with this work for additional information - regarding copyright ownership. The ASF licenses this file - to you under the Apache License, Version 2.0 (the - "License"); you may not use this file except in compliance - with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, - software distributed under the License is distributed on an - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - KIND, either express or implied. See the License for the - specific language governing permissions and limitations - under the License. ---> - -## Feature Vectorizer - -`array<string> vectorize_feature(array<string> featureNames, ...)` is useful to generate a feature vector for each row, from a table. - -```sql -select vectorize_features(array("a","b"),"0.2","0.3") from dual; ->["a:0.2","b:0.3"] - --- avoid zero weight -select vectorize_features(array("a","b"),"0.2",0) from dual; -> ["a:0.2"] - --- true boolean value is treated as 1.0 (a categorical value w/ its column name) -select vectorize_features(array("a","b","bool"),0.2,0.3,true) from dual; -> ["a:0.2","b:0.3","bool:1.0"] - --- an example to generate feature vectors from table -select * from dual; -> 1 -select vectorize_features(array("a"),*) from dual; -> ["a:1.0"] - --- has categorical feature -select vectorize_features(array("a","b","wheather"),"0.2","0.3","sunny") from dual; -> ["a:0.2","b:0.3","whether#sunny"] -``` - -```sql -select - id, - vectorize_features( - array("age","job","marital","education","default","balance","housing","loan","contact","day","month","duration","campaign","pdays","previous","poutcome"), - age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome - ) as features, - y -from - train -limit 2; - -> 1 ["age:39.0","job#blue-collar","marital#married","education#secondary","default#no","balance:1756.0","housing#yes","loan#no","contact#cellular","day:3.0","month#apr","duration:939.0","campaign:1.0","pdays:-1.0","poutcome#unknown"] 1 -> 2 ["age:51.0","job#entrepreneur","marital#married","education#primary","default#no","balance:1443.0","housing#no","loan#no","contact#cellular","day:18.0","month#feb","duration:172.0","campaign:10.0","pdays:-1.0","poutcome#unknown"] 1 -``` \ No newline at end of file
