This is an automated email from the ASF dual-hosted git repository. zhangzp pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/flink-ml.git
The following commit(s) were added to refs/heads/master by this push: new e007030 [FLINK-29171] Add documents for MaxAbs, FeatureHasher, Interaction, VectorSlicer, ElementwiseProduct and Binarizer e007030 is described below commit e007030da6e4a35ef0d11c3ad0c76cf723f17d63 Author: weibo <wbz...@pku.edu.cn> AuthorDate: Mon Sep 5 17:33:55 2022 +0800 [FLINK-29171] Add documents for MaxAbs, FeatureHasher, Interaction, VectorSlicer, ElementwiseProduct and Binarizer This closes #151. --- docs/content/docs/operators/classification/knn.md | 14 +- .../docs/operators/classification/linearsvc.md | 28 ++-- .../operators/classification/logisticregression.md | 40 ++--- .../docs/operators/classification/naivebayes.md | 24 +-- .../clustering/agglomerativeclustering.md | 3 +- docs/content/docs/operators/clustering/kmeans.md | 12 +- .../evaluation/binaryclassificationevaluator.md | 34 ++-- docs/content/docs/operators/feature/binarizer.md | 183 +++++++++++++++++++++ docs/content/docs/operators/feature/bucketizer.md | 24 +-- .../docs/operators/feature/elementwiseproduct.md | 157 ++++++++++++++++++ .../docs/operators/feature/featurehasher.md | 177 ++++++++++++++++++++ docs/content/docs/operators/feature/hashingtf.md | 165 +++++++++++++++++++ docs/content/docs/operators/feature/interaction.md | 169 +++++++++++++++++++ .../feature/{minmaxscaler.md => maxabsscaler.md} | 76 ++++----- .../content/docs/operators/feature/minmaxscaler.md | 14 +- .../docs/operators/feature/onehotencoder.md | 24 +-- .../docs/operators/feature/standardscaler.md | 14 +- .../docs/operators/feature/stringindexer.md | 22 +-- .../docs/operators/feature/vectorassembler.md | 14 +- .../content/docs/operators/feature/vectorslicer.md | 158 ++++++++++++++++++ .../docs/operators/regression/linearregression.md | 16 +- 21 files changed, 1188 insertions(+), 180 deletions(-) diff --git a/docs/content/docs/operators/classification/knn.md b/docs/content/docs/operators/classification/knn.md index 5af83f3..9d51c4e 100644 --- a/docs/content/docs/operators/classification/knn.md +++ b/docs/content/docs/operators/classification/knn.md @@ -32,16 +32,16 @@ to that label. ### Input Columns -| Param name | Type | Default | Description | -| :---------- | :------ | :----------- | :--------------- | -| featuresCol | Vector | `"features"` | Feature vector | -| labelCol | Integer | `"label"` | Label to predict | +| Param name | Type | Default | Description | +| :---------- | :------ | :----------- |:------------------| +| featuresCol | Vector | `"features"` | Feature vector. | +| labelCol | Integer | `"label"` | Label to predict. | ### Output Columns -| Param name | Type | Default | Description | -| :------------ | :------ | :------------- | :-------------- | -| predictionCol | Integer | `"prediction"` | Predicted label | +| Param name | Type | Default | Description | +| :------------ | :------ | :------------- |:-----------------| +| predictionCol | Integer | `"prediction"` | Predicted label. | ### Parameters diff --git a/docs/content/docs/operators/classification/linearsvc.md b/docs/content/docs/operators/classification/linearsvc.md index c2530d5..6808578 100644 --- a/docs/content/docs/operators/classification/linearsvc.md +++ b/docs/content/docs/operators/classification/linearsvc.md @@ -31,28 +31,28 @@ a hyperplane to maximize the distance between classified samples. ### Input Columns -| Param name | Type | Default | Description | -| :---------- | :------ | :----------- | :--------------- | -| featuresCol | Vector | `"features"` | Feature vector | -| labelCol | Integer | `"label"` | Label to predict | -| weightCol | Double | `"weight"` | Weight of sample | +| Param name | Type | Default | Description | +| :---------- | :------ | :----------- |:------------------| +| featuresCol | Vector | `"features"` | Feature vector. | +| labelCol | Integer | `"label"` | Label to predict. | +| weightCol | Double | `"weight"` | Weight of sample. | ### Output Columns -| Param name | Type | Default | Description | -| :--------------- | :------ | :---------------- | :-------------------------------------- | -| predictionCol | Integer | `"prediction"` | Label of the max probability | -| rawPredictionCol | Vector | `"rawPrediction"` | Vector of the probability of each label | +| Param name | Type | Default | Description | +| :--------------- | :------ | :---------------- |:-----------------------------------------| +| predictionCol | Integer | `"prediction"` | Label of the max probability. | +| rawPredictionCol | Vector | `"rawPrediction"` | Vector of the probability of each label. | ### Parameters Below are the parameters required by `LinearSVCModel`. -| Key | Default | Type | Required | Description | -| ---------------- | ----------------- | ------ | -------- | ------------------------------------------------------------ | -| featuresCol | `"features"` | String | no | Features column name. | -| predictionCol | `"prediction"` | String | no | Prediction column name. | -| rawPredictionCol | `"rawPrediction"` | String | no | Raw prediction column name. | +| Key | Default | Type | Required | Description | +|------------------|-------------------|--------|----------|-------------------------------------------------------------------------| +| featuresCol | `"features"` | String | no | Features column name. | +| predictionCol | `"prediction"` | String | no | Prediction column name. | +| rawPredictionCol | `"rawPrediction"` | String | no | Raw prediction column name. | | threshold | `0.0` | Double | no | Threshold in binary classification prediction applied to rawPrediction. | `LinearSVC` needs parameters above and also below. diff --git a/docs/content/docs/operators/classification/logisticregression.md b/docs/content/docs/operators/classification/logisticregression.md index 26818e3..7f73664 100644 --- a/docs/content/docs/operators/classification/logisticregression.md +++ b/docs/content/docs/operators/classification/logisticregression.md @@ -30,18 +30,18 @@ widely used to predict a binary response. ### Input Columns -| Param name | Type | Default | Description | -| :---------- | :------ | :----------- | :--------------- | -| featuresCol | Vector | `"features"` | Feature vector | -| labelCol | Integer | `"label"` | Label to predict | -| weightCol | Double | `"weight"` | Weight of sample | +| Param name | Type | Default | Description | +| :---------- | :------ | :----------- |:------------------| +| featuresCol | Vector | `"features"` | Feature vector. | +| labelCol | Integer | `"label"` | Label to predict. | +| weightCol | Double | `"weight"` | Weight of sample. | ### Output Columns -| Param name | Type | Default | Description | -| :--------------- | :------ | :---------------- | :-------------------------------------- | -| predictionCol | Integer | `"prediction"` | Label of the max probability | -| rawPredictionCol | Vector | `"rawPrediction"` | Vector of the probability of each label | +| Param name | Type | Default | Description | +| :--------------- | :------ | :---------------- |:-----------------------------------------| +| predictionCol | Integer | `"prediction"` | Label of the max probability. | +| rawPredictionCol | Vector | `"rawPrediction"` | Vector of the probability of each label. | ### Parameters @@ -55,17 +55,17 @@ Below are the parameters required by `LogisticRegressionModel`. `LogisticRegression` needs parameters above and also below. -| Key | Default | Type | Required | Description | -| --------------- | --------- | ------- | -------- | ------------------------------------------------------------ | -| labelCol | `"label"` | String | no | Label column name. | -| weightCol | `null` | String | no | Weight column name. | -| maxIter | `20` | Integer | no | Maximum number of iterations. | -| reg | `0.` | Double | no | Regularization parameter. | -| elasticNet | `0.` | Double | no | ElasticNet parameter. | -| learningRate | `0.1` | Double | no | Learning rate of optimization method. | -| globalBatchSize | `32` | Integer | no | Global batch size of training algorithms. | -| tol | `1e-6` | Double | no | Convergence tolerance for iterative algorithms. | -| multiClass | `"auto"` | String | no | Classification type. Supported values: "auto", "binomial", "multinomial" | +| Key | Default | Type | Required | Description | +|-----------------|-----------|---------|----------|---------------------------------------------------------------------------| +| labelCol | `"label"` | String | no | Label column name. | +| weightCol | `null` | String | no | Weight column name. | +| maxIter | `20` | Integer | no | Maximum number of iterations. | +| reg | `0.` | Double | no | Regularization parameter. | +| elasticNet | `0.` | Double | no | ElasticNet parameter. | +| learningRate | `0.1` | Double | no | Learning rate of optimization method. | +| globalBatchSize | `32` | Integer | no | Global batch size of training algorithms. | +| tol | `1e-6` | Double | no | Convergence tolerance for iterative algorithms. | +| multiClass | `"auto"` | String | no | Classification type. Supported values: "auto", "binomial", "multinomial". | ### Examples {{< tabs examples >}} diff --git a/docs/content/docs/operators/classification/naivebayes.md b/docs/content/docs/operators/classification/naivebayes.md index b913c3f..756c3c0 100644 --- a/docs/content/docs/operators/classification/naivebayes.md +++ b/docs/content/docs/operators/classification/naivebayes.md @@ -30,26 +30,26 @@ there is strong (naive) independence between every pair of features. ### Input Columns -| Param name | Type | Default | Description | -| :---------- | :------ | :----------- | :--------------- | -| featuresCol | Vector | `"features"` | Feature vector | -| labelCol | Integer | `"label"` | Label to predict | +| Param name | Type | Default | Description | +| :---------- | :------ | :----------- |:------------------| +| featuresCol | Vector | `"features"` | Feature vector. | +| labelCol | Integer | `"label"` | Label to predict. | ### Output Columns -| Param name | Type | Default | Description | -| :------------ | :------ | :------------- | :-------------- | -| predictionCol | Integer | `"prediction"` | Predicted label | +| Param name | Type | Default | Description | +| :------------ | :------ | :------------- |:-----------------| +| predictionCol | Integer | `"prediction"` | Predicted label. | ### Parameters Below are parameters required by `NaiveBayesModel`. -| Key | Default | Type | Required | Description | -| ------------- | --------------- | ------ | -------- | ----------------------------------------------- | -| modelType | `"multinomial"` | String | no | The model type. Supported values: "multinomial" | -| featuresCol | `"features"` | String | no | Features column name. | -| predictionCol | `"prediction"` | String | no | Prediction column name. | +| Key | Default | Type | Required | Description | +| ------------- | --------------- | ------ | -------- |--------------------------------------------------| +| modelType | `"multinomial"` | String | no | The model type. Supported values: "multinomial". | +| featuresCol | `"features"` | String | no | Features column name. | +| predictionCol | `"prediction"` | String | no | Prediction column name. | `NaiveBayes` needs parameters above and also below. diff --git a/docs/content/docs/operators/clustering/agglomerativeclustering.md b/docs/content/docs/operators/clustering/agglomerativeclustering.md index c2988b5..7bca4b9 100644 --- a/docs/content/docs/operators/clustering/agglomerativeclustering.md +++ b/docs/content/docs/operators/clustering/agglomerativeclustering.md @@ -120,8 +120,7 @@ public class AgglomerativeClusteringExample { {{< tab "Python">}} ```python -# Simple program that creates a Bucketizer instance and uses it for feature -# engineering. +# Simple program that creates an agglomerativeclustering instance and uses it for clustering. from pyflink.common import Types from pyflink.datastream import StreamExecutionEnvironment diff --git a/docs/content/docs/operators/clustering/kmeans.md b/docs/content/docs/operators/clustering/kmeans.md index f5c4bd2..68ee678 100644 --- a/docs/content/docs/operators/clustering/kmeans.md +++ b/docs/content/docs/operators/clustering/kmeans.md @@ -30,15 +30,15 @@ into a predefined number of clusters. ### Input Columns -| Param name | Type | Default | Description | -|:------------|:-------|:-------------|:---------------| -| featuresCol | Vector | `"features"` | Feature vector | +| Param name | Type | Default | Description | +|:------------|:-------|:-------------|:----------------| +| featuresCol | Vector | `"features"` | Feature vector. | ### Output Columns -| Param name | Type | Default | Description | -|:--------------|:--------|:---------------|:-------------------------| -| predictionCol | Integer | `"prediction"` | Predicted cluster center | +| Param name | Type | Default | Description | +|:--------------|:--------|:---------------|:--------------------------| +| predictionCol | Integer | `"prediction"` | Predicted cluster center. | ### Parameters diff --git a/docs/content/docs/operators/evaluation/binaryclassificationevaluator.md b/docs/content/docs/operators/evaluation/binaryclassificationevaluator.md index a30189a..0fd657e 100644 --- a/docs/content/docs/operators/evaluation/binaryclassificationevaluator.md +++ b/docs/content/docs/operators/evaluation/binaryclassificationevaluator.md @@ -35,29 +35,29 @@ predictions, scores, or label probabilities). The output may contain different metrics defined by the parameter `MetricsNames`. ### Input Columns -| Param name | Type | Default | Description | -| :--------------- | :------------ | :-------------- | :------------------------ | -| labelCol | Number | `"label"` | The label of this entry | -| rawPredictionCol | Vector/Number | `rawPrediction` | The raw prediction result | -| weightCol | Number | `null` | The weight of this entry | +| Param name | Type | Default | Description | +| :--------------- | :------------ | :-------------- |:---------------------------| +| labelCol | Number | `"label"` | The label of this entry. | +| rawPredictionCol | Vector/Number | `rawPrediction` | The raw prediction result. | +| weightCol | Number | `null` | The weight of this entry. | ### Output Columns -| Column name | Type | Description | -| ----------------- | ------ | ------------------------------------------------------------ | -| "areaUnderROC" | Double | the area under the receiver operating characteristic (ROC) curve | -| "areaUnderPR" | Double | the area under the precision-recall curve | -| "areaUnderLorenz" | Double | Kolmogorov-Smirnov, measures the ability of the model to separate positive and negative samples | -| "ks" | Double | the area under the lorenz curve | +| Column name | Type | Description | +| ----------------- | ------ |--------------------------------------------------------------------------------------------------| +| "areaUnderROC" | Double | The area under the receiver operating characteristic (ROC) curve. | +| "areaUnderPR" | Double | The area under the precision-recall curve. | +| "areaUnderLorenz" | Double | Kolmogorov-Smirnov, measures the ability of the model to separate positive and negative samples. | +| "ks" | Double | The area under the lorenz curve. | ### Parameters -| Key | Default | Type | Required | Description | -| ---------------- | ------------------------------------------------------------ | ------------ | -------- | ---------------------------- | -| labelCol | `"label"` | String | no | Label column name. | -| weightCol | `null` | String | no | Weight column name. | -| rawPredictionCol | `"rawPrediction"` | String | no | Raw prediction column name. | -| metricsNames | `[BinaryClassificationEvaluatorParams.AREA_UNDER_ROC, BinaryClassificationEvaluatorParams.AREA_UNDER_PR]` | String Array | no | Names of the output metrics. | +| Key | Default | Type | Required | Description | +|------------------|-----------------------------------|----------|----------|--------------------------------------------------------------------------------------------------------| +| labelCol | `"label"` | String | no | Label column name. | +| weightCol | `null` | String | no | Weight column name. | +| rawPredictionCol | `"rawPrediction"` | String | no | Raw prediction column name. | +| metricsNames | `["areaUnderROC", "areaUnderPR"]` | String[] | no | Names of the output metrics. Supported values: 'areaUnderROC', 'areaUnderPR', 'areaUnderLorenz', 'ks'. | ### Examples diff --git a/docs/content/docs/operators/feature/binarizer.md b/docs/content/docs/operators/feature/binarizer.md new file mode 100644 index 0000000..74388b4 --- /dev/null +++ b/docs/content/docs/operators/feature/binarizer.md @@ -0,0 +1,183 @@ +--- +title: "Binarizer" +weight: 1 +type: docs +aliases: +- /operators/feature/binarizer.html +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +## Binarizer + +Binarizer binarizes the columns of continuous features by the given thresholds. +The continuous features may be DenseVector, SparseVector, or Numerical Value. + +### Input Columns + +| Param name | Type | Default | Description | +|:-----------|:--------------|:--------|:--------------------------------| +| inputCols | Number/Vector | `null` | Number/Vectors to be binarized. | + +### Output Columns + +| Param name | Type | Default | Description | +|:-----------|:--------------|:--------|:--------------------------| +| outputCols | Number/Vector | `null` | Binarized Number/Vectors. | + +### Parameters + +| Key | Default | Type | Required | Description | +|-------------|-----------|----------|----------|------------------------------------------------------| +| inputCols | `null` | String[] | yes | Input column names. | +| outputCols | `null` | String[] | yes | Output column name. | +| thresholds | `null` | Double[] | yes | The thresholds used to binarize continuous features. | + +### Examples + +{{< tabs examples >}} + +{{< tab "Java">}} + +```java +import org.apache.flink.ml.feature.binarizer.Binarizer; +import org.apache.flink.ml.linalg.Vectors; +import org.apache.flink.streaming.api.datastream.DataStream; +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; +import org.apache.flink.table.api.Table; +import org.apache.flink.table.api.bridge.java.StreamTableEnvironment; +import org.apache.flink.types.Row; +import org.apache.flink.util.CloseableIterator; + +import java.util.Arrays; + +/** Simple program that creates a Binarizer instance and uses it for feature engineering. */ +public class BinarizerExample { + public static void main(String[] args) { + StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); + StreamTableEnvironment tEnv = StreamTableEnvironment.create(env); + + // Generates input data. + DataStream<Row> inputStream = + env.fromElements( + Row.of( + 1, + Vectors.dense(1, 2), + Vectors.sparse( + 17, new int[] {0, 3, 9}, new double[] {1.0, 2.0, 7.0})), + Row.of( + 2, + Vectors.dense(2, 1), + Vectors.sparse( + 17, new int[] {0, 2, 14}, new double[] {5.0, 4.0, 1.0})), + Row.of( + 3, + Vectors.dense(5, 18), + Vectors.sparse( + 17, new int[] {0, 11, 12}, new double[] {2.0, 4.0, 4.0}))); + + Table inputTable = tEnv.fromDataStream(inputStream).as("f0", "f1", "f2"); + + // Creates a Binarizer object and initializes its parameters. + Binarizer binarizer = + new Binarizer() + .setInputCols("f0", "f1", "f2") + .setOutputCols("of0", "of1", "of2") + .setThresholds(0.0, 0.0, 0.0); + + // Transforms input data. + Table outputTable = binarizer.transform(inputTable)[0]; + + // Extracts and displays the results. + for (CloseableIterator<Row> it = outputTable.execute().collect(); it.hasNext(); ) { + Row row = it.next(); + + Object[] inputValues = new Object[binarizer.getInputCols().length]; + Object[] outputValues = new Object[binarizer.getInputCols().length]; + for (int i = 0; i < inputValues.length; i++) { + inputValues[i] = row.getField(binarizer.getInputCols()[i]); + outputValues[i] = row.getField(binarizer.getOutputCols()[i]); + } + + System.out.printf( + "Input Values: %s\tOutput Values: %s\n", + Arrays.toString(inputValues), Arrays.toString(outputValues)); + } + } +} + +``` + +{{< /tab>}} + +{{< tab "Python">}} + +```python +# Simple program that creates a Binarizer instance and uses it for feature +# engineering. + +from pyflink.common import Types +from pyflink.datastream import StreamExecutionEnvironment +from pyflink.ml.core.linalg import Vectors, DenseVectorTypeInfo +from pyflink.ml.lib.feature.binarizer import Binarizer +from pyflink.table import StreamTableEnvironment + +# create a new StreamExecutionEnvironment +env = StreamExecutionEnvironment.get_execution_environment() + +# create a StreamTableEnvironment +t_env = StreamTableEnvironment.create(env) + +# generate input data +input_data_table = t_env.from_data_stream( + env.from_collection([ + (1, + Vectors.dense(3, 4)), + (2, + Vectors.dense(6, 2)) + ], + type_info=Types.ROW_NAMED( + ['f0', 'f1'], + [Types.INT(), DenseVectorTypeInfo()]))) + +# create an binarizer object and initialize its parameters +binarizer = Binarizer() \ + .set_input_cols('f0', 'f1') \ + .set_output_cols('of0', 'of1') \ + .set_thresholds(1.5, 3.5) + +# use the binarizer for feature engineering +output = binarizer.transform(input_data_table)[0] + +# extract and display the results +field_names = output.get_schema().get_field_names() +input_values = [None for _ in binarizer.get_input_cols()] +output_values = [None for _ in binarizer.get_output_cols()] +for result in t_env.to_data_stream(output).execute_and_collect(): + for i in range(len(binarizer.get_input_cols())): + input_values[i] = result[field_names.index(binarizer.get_input_cols()[i])] + output_values[i] = result[field_names.index(binarizer.get_output_cols()[i])] + print('Input Values: ' + str(input_values) + '\tOutput Values: ' + str(output_values)) + +``` + +{{< /tab>}} + +{{< /tabs>}} diff --git a/docs/content/docs/operators/feature/bucketizer.md b/docs/content/docs/operators/feature/bucketizer.md index 9430691..c898ab3 100644 --- a/docs/content/docs/operators/feature/bucketizer.md +++ b/docs/content/docs/operators/feature/bucketizer.md @@ -32,24 +32,24 @@ multiple columns of discrete features, i.e., buckets indices. The indices are in [0, numSplitsInThisColumn - 1]. ### Input Columns -| Param name | Type | Default | Description | -| :--------- | :----- | :------ | :----------------------------------- | -| inputCols | Number | `null` | Continuous features to be bucketized | +| Param name | Type | Default | Description | +|:-----------|:-------|:--------|:--------------------------------------| +| inputCols | Number | `null` | Continuous features to be bucketized. | ### Output Columns -| Param name | Type | Default | Description | -| :--------- | :----- | :------ | :--------------------------- | -| outputCols | Double | `null` | Discrete bucketized features | +| Param name | Type | Default | Description | +|:-----------|:-------|:--------|:----------------------| +| outputCols | Double | `null` | Discretized features. | ### Parameters -| Key | Default | Type | Required | Description | -| ------------- | -------------------------------- | ----------- | -------- | ------------------------------------------------------------ | -| inputCols | `null` | String | yes | Input column names. | -| outputCols | `null` | String | yes | Output column names. | -| handleInvalid | `HasHandleInvalid.ERROR_INVALID` | String | No | Strategy to handle invalid entries. | -| splitsArray | `null` | Double\[][] | yes | Array of split points for mapping continuous features into buckets. | +| Key | Default | Type | Required | Description | +|---------------|-----------|-------------|----------|--------------------------------------------------------------------------------| +| inputCols | `null` | String[] | yes | Input column names. | +| outputCols | `null` | String[] | yes | Output column names. | +| handleInvalid | `"error"` | String | no | Strategy to handle invalid entries. Supported values: 'error', 'skip', 'keep'. | +| splitsArray | `null` | Double\[][] | yes | Array of split points for mapping continuous features into buckets. | ### Examples diff --git a/docs/content/docs/operators/feature/elementwiseproduct.md b/docs/content/docs/operators/feature/elementwiseproduct.md new file mode 100644 index 0000000..1d33634 --- /dev/null +++ b/docs/content/docs/operators/feature/elementwiseproduct.md @@ -0,0 +1,157 @@ +--- +title: "Elementwise Product" +weight: 1 +type: docs +aliases: +- /operators/feature/elementwiseproduct.html +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +## Elementwise Product + +Elementwise Product multiplies each input vector with a given scaling vector using +Hadamard product. If the size of the input vector does not equal the size of the +scaling vector, the transformer will throw an IllegalArgumentException. + +### Input Columns + +| Param name | Type | Default | Description | +|:-----------|:-------|:----------|:-----------------------| +| inputCol | Vector | `"input"` | Features to be scaled. | + +### Output Columns + +| Param name | Type | Default | Description | +|:-----------|:-------|:-----------|:-----------------| +| outputCol | Vector | `"output"` | Scaled features. | + +### Parameters + +| Key | Default | Type | Required | Description | +|------------|------------|--------|----------|---------------------| +| inputCol | `"input"` | String | no | Input column name. | +| outputCol | `"output"` | String | no | Output column name. | +| scalingVec | `null` | String | yes | The scaling vector. | +### Examples + +{{< tabs examples >}} + +{{< tab "Java">}} + +```java +import org.apache.flink.ml.feature.elementwiseproduct.ElementwiseProduct; +import org.apache.flink.ml.linalg.Vector; +import org.apache.flink.ml.linalg.Vectors; +import org.apache.flink.streaming.api.datastream.DataStream; +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; +import org.apache.flink.table.api.Table; +import org.apache.flink.table.api.bridge.java.StreamTableEnvironment; +import org.apache.flink.types.Row; +import org.apache.flink.util.CloseableIterator; + +/** + * Simple program that creates an ElementwiseProduct instance and uses it for feature engineering. + */ +public class ElementwiseProductExample { + public static void main(String[] args) { + StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); + StreamTableEnvironment tEnv = StreamTableEnvironment.create(env); + + // Generates input data. + DataStream<Row> inputStream = + env.fromElements( + Row.of(0, Vectors.dense(1.1, 3.2)), Row.of(1, Vectors.dense(2.1, 3.1))); + + Table inputTable = tEnv.fromDataStream(inputStream).as("id", "vec"); + + // Creates an ElementwiseProduct object and initializes its parameters. + ElementwiseProduct elementwiseProduct = + new ElementwiseProduct() + .setInputCol("vec") + .setOutputCol("outputVec") + .setScalingVec(Vectors.dense(1.1, 1.1)); + + // Transforms input data. + Table outputTable = elementwiseProduct.transform(inputTable)[0]; + + // Extracts and displays the results. + for (CloseableIterator<Row> it = outputTable.execute().collect(); it.hasNext(); ) { + Row row = it.next(); + Vector inputValue = (Vector) row.getField(elementwiseProduct.getInputCol()); + Vector outputValue = (Vector) row.getField(elementwiseProduct.getOutputCol()); + System.out.printf("Input Value: %s \tOutput Value: %s\n", inputValue, outputValue); + } + } +} + +``` + +{{< /tab>}} + +{{< tab "Python">}} + +```python +# Simple program that creates an ElementwiseProduct instance and uses it for feature +# engineering. + +from pyflink.common import Types +from pyflink.datastream import StreamExecutionEnvironment +from pyflink.ml.core.linalg import Vectors, DenseVectorTypeInfo +from pyflink.ml.lib.feature.elementwiseproduct import ElementwiseProduct +from pyflink.table import StreamTableEnvironment + +# create a new StreamExecutionEnvironment +env = StreamExecutionEnvironment.get_execution_environment() + +# create a StreamTableEnvironment +t_env = StreamTableEnvironment.create(env) + +# generate input data +input_data_table = t_env.from_data_stream( + env.from_collection([ + (1, Vectors.dense(2.1, 3.1)), + (2, Vectors.dense(1.1, 3.3)) + ], + type_info=Types.ROW_NAMED( + ['id', 'vec'], + [Types.INT(), DenseVectorTypeInfo()]))) + +# create an elementwise product object and initialize its parameters +elementwise_product = ElementwiseProduct() \ + .set_input_col('vec') \ + .set_output_col('output_vec') \ + .set_scaling_vec(Vectors.dense(1.1, 1.1)) + +# use the elementwise product object for feature engineering +output = elementwise_product.transform(input_data_table)[0] + +# extract and display the results +field_names = output.get_schema().get_field_names() +for result in t_env.to_data_stream(output).execute_and_collect(): + input_value = result[field_names.index(elementwise_product.get_input_col())] + output_value = result[field_names.index(elementwise_product.get_output_col())] + print('Input Value: ' + str(input_value) + '\tOutput Value: ' + str(output_value)) + +``` + +{{< /tab>}} + +{{< /tabs>}} diff --git a/docs/content/docs/operators/feature/featurehasher.md b/docs/content/docs/operators/feature/featurehasher.md new file mode 100644 index 0000000..78ade1f --- /dev/null +++ b/docs/content/docs/operators/feature/featurehasher.md @@ -0,0 +1,177 @@ +--- +title: "Feature Hasher" +weight: 1 +type: docs +aliases: +- /operators/feature/featurehasher.html +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +## Feature Hasher + +Feature Hasher transforms a set of categorical or numerical features into a sparse vector of +a specified dimension. The rules of hashing categorical columns and numerical columns are as +follows: + +<ul> +<li>For numerical columns, the index of this feature in the output vector is the hash value of + the column name and its correponding value is the same as the input. +<li>For categorical columns, the index of this feature in the output vector is the hash value + of the string "column_name=value" and the corresponding value is 1.0. +</ul> + +<p>If multiple features are projected into the same column, the output values are accumulated. +For the hashing trick, see https://en.wikipedia.org/wiki/Feature_hashing for details. + +### Input Columns + +| Param name | Type | Default | Description | +|:-----------|:----------------------|:--------|:----------------------| +| inputCols | Number/String/Boolean | `null` | Columns to be hashed. | + +### Output Columns + +| Param name | Type | Default | Description | +|:-----------|:-------|:-----------|:---------------| +| outputCol | Vector | `"output"` | Output vector. | + +### Parameters + +| Key | Default | Type | Required | Description | +|-----------------|------------|-----------|----------|---------------------------| +| inputCols | `null` | String[] | yes | Input column names. | +| outputCol | `"output"` | String | no | Output column name. | +| categoricalCols | `[]` | String[] | no | Categorical column names. | +| numFeatures | `262144` | Integer | no | The number of features. | +### Examples + +{{< tabs examples >}} + +{{< tab "Java">}} + +```java +import org.apache.flink.ml.feature.featurehasher.FeatureHasher; +import org.apache.flink.ml.linalg.Vector; +import org.apache.flink.streaming.api.datastream.DataStream; +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; +import org.apache.flink.table.api.Table; +import org.apache.flink.table.api.bridge.java.StreamTableEnvironment; +import org.apache.flink.types.Row; +import org.apache.flink.util.CloseableIterator; + +import java.util.Arrays; + +/** Simple program that creates a FeatureHasher instance and uses it for feature engineering. */ +public class FeatureHasherExample { + public static void main(String[] args) { + + StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); + StreamTableEnvironment tEnv = StreamTableEnvironment.create(env); + + // Generates input data. + DataStream<Row> dataStream = + env.fromCollection( + Arrays.asList(Row.of(0, "a", 1.0, true), Row.of(1, "c", 1.0, false))); + Table inputDataTable = tEnv.fromDataStream(dataStream).as("id", "f0", "f1", "f2"); + + // Creates a FeatureHasher object and initializes its parameters. + FeatureHasher featureHash = + new FeatureHasher() + .setInputCols("f0", "f1", "f2") + .setCategoricalCols("f0", "f2") + .setOutputCol("vec") + .setNumFeatures(1000); + + // Uses the FeatureHasher object for feature transformations. + Table outputTable = featureHash.transform(inputDataTable)[0]; + + // Extracts and displays the results. + for (CloseableIterator<Row> it = outputTable.execute().collect(); it.hasNext(); ) { + Row row = it.next(); + + Object[] inputValues = new Object[featureHash.getInputCols().length]; + for (int i = 0; i < inputValues.length; i++) { + inputValues[i] = row.getField(featureHash.getInputCols()[i]); + } + Vector outputValue = (Vector) row.getField(featureHash.getOutputCol()); + + System.out.printf( + "Input Values: %s \tOutput Value: %s\n", + Arrays.toString(inputValues), outputValue); + } + } +} + +``` + +{{< /tab>}} + +{{< tab "Python">}} + +```python +# Simple program that creates a FeatureHasher instance and uses it for feature +# engineering. + +from pyflink.common import Types +from pyflink.datastream import StreamExecutionEnvironment +from pyflink.ml.lib.feature.featurehasher import FeatureHasher +from pyflink.table import StreamTableEnvironment + +# create a new StreamExecutionEnvironment +env = StreamExecutionEnvironment.get_execution_environment() + +# create a StreamTableEnvironment +t_env = StreamTableEnvironment.create(env) + +# generate input data +input_data_table = t_env.from_data_stream( + env.from_collection([ + (0, 'a', 1.0, True), + (1, 'c', 1.0, False), + ], + type_info=Types.ROW_NAMED( + ['id', 'f0', 'f1', 'f2'], + [Types.INT(), Types.STRING(), Types.DOUBLE(), Types.BOOLEAN()]))) + +# create a feature hasher object and initialize its parameters +feature_hasher = FeatureHasher() \ + .set_input_cols('f0', 'f1', 'f2') \ + .set_categorical_cols('f0', 'f2') \ + .set_output_col('vec') \ + .set_num_features(1000) + +# use the feature hasher for feature engineering +output = feature_hasher.transform(input_data_table)[0] + +# extract and display the results +field_names = output.get_schema().get_field_names() +input_values = [None for _ in feature_hasher.get_input_cols()] +for result in t_env.to_data_stream(output).execute_and_collect(): + for i in range(len(feature_hasher.get_input_cols())): + input_values[i] = result[field_names.index(feature_hasher.get_input_cols()[i])] + output_value = result[field_names.index(feature_hasher.get_output_col())] + print('Input Values: ' + str(input_values) + '\tOutput Value: ' + str(output_value)) + +``` + +{{< /tab>}} + +{{< /tabs>}} diff --git a/docs/content/docs/operators/feature/hashingtf.md b/docs/content/docs/operators/feature/hashingtf.md new file mode 100644 index 0000000..088176a --- /dev/null +++ b/docs/content/docs/operators/feature/hashingtf.md @@ -0,0 +1,165 @@ +--- +title: "HashingTF" +weight: 1 +type: docs +aliases: +- /operators/feature/hashingtf.html +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +## HashingTF + +HashingTF maps a sequence of terms(strings, numbers, booleans) +to a sparse vector with a specified dimension using the hashing +trick. If multiple features are projected into the same column, +the output values are accumulated by default. + +### Input Columns + +| Param name | Type | Default | Description | +|:-----------|:----------------------------------------------|:----------|:-------------------------| +| inputCol | List/Array of primitive data types or strings | `"input"` | Input sequence of terms. | + +### Output Columns + +| Param name | Type | Default | Description | +|:-----------|:-------------|:-----------|:----------------------| +| outputCol | SparseVector | `"output"` | Output sparse vector. | + +### Parameters + +| Key | Default | Type | Required | Description | +|:------------|:-----------|:--------|:---------|:--------------------------------------------------------------------| +| binary | `false` | Boolean | no | Whether each dimension of the output vector is binary or not. | +| inputCol | `"input"` | String | no | Input column name. | +| outputCol | `"output"` | String | no | Output column name. | +| numFeatures | `262144` | Integer | no | The number of features. It will be the length of the output vector. | + + +### Examples + +{{< tabs examples >}} + +{{< tab "Java">}} + +```java + +import org.apache.flink.ml.feature.hashingtf.HashingTF; +import org.apache.flink.ml.linalg.SparseVector; +import org.apache.flink.streaming.api.datastream.DataStream; +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; +import org.apache.flink.table.api.Table; +import org.apache.flink.table.api.bridge.java.StreamTableEnvironment; +import org.apache.flink.types.Row; +import org.apache.flink.util.CloseableIterator; + +import java.util.Arrays; +import java.util.List; + +/** Simple program that creates a HashingTF instance and uses it for feature engineering. */ +public class HashingTFExample { + public static void main(String[] args) { + StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); + StreamTableEnvironment tEnv = StreamTableEnvironment.create(env); + + // Generates input data. + DataStream<Row> inputStream = + env.fromElements( + Row.of( + Arrays.asList( + "HashingTFTest", "Hashing", "Term", "Frequency", "Test")), + Row.of( + Arrays.asList( + "HashingTFTest", "Hashing", "Hashing", "Test", "Test"))); + + Table inputTable = tEnv.fromDataStream(inputStream).as("input"); + + // Creates a HashingTF object and initializes its parameters. + HashingTF hashingTF = + new HashingTF().setInputCol("input").setOutputCol("output").setNumFeatures(128); + + // Uses the HashingTF object for feature transformations. + Table outputTable = hashingTF.transform(inputTable)[0]; + + // Extracts and displays the results. + for (CloseableIterator<Row> it = outputTable.execute().collect(); it.hasNext(); ) { + Row row = it.next(); + + List<Object> inputValue = (List<Object>) row.getField(hashingTF.getInputCol()); + SparseVector outputValue = (SparseVector) row.getField(hashingTF.getOutputCol()); + + System.out.printf( + "Input Value: %s \tOutput Value: %s\n", + Arrays.toString(inputValue.stream().toArray()), outputValue); + } + } +} + +``` + +{{< /tab>}} + +{{< tab "Python">}} + +```python +# Simple program that creates a HashingTF instance and uses it for feature +# engineering. + +from pyflink.common import Types +from pyflink.datastream import StreamExecutionEnvironment +from pyflink.ml.lib.feature.hashingtf import HashingTF +from pyflink.table import StreamTableEnvironment + +env = StreamExecutionEnvironment.get_execution_environment() + +t_env = StreamTableEnvironment.create(env) + +# Generates input data. +input_data_table = t_env.from_data_stream( + env.from_collection([ + (['HashingTFTest', 'Hashing', 'Term', 'Frequency', 'Test'],), + (['HashingTFTest', 'Hashing', 'Hashing', 'Test', 'Test'],), + ], + type_info=Types.ROW_NAMED( + ["input", ], + [Types.OBJECT_ARRAY(Types.STRING())]))) + +# Creates a HashingTF object and initializes its parameters. +hashing_tf = HashingTF() \ + .set_input_col('input') \ + .set_num_features(128) \ + .set_output_col('output') + +# Uses the HashingTF object for feature transformations. +output = hashing_tf.transform(input_data_table)[0] + +# Extracts and displays the results. +field_names = output.get_schema().get_field_names() +for result in t_env.to_data_stream(output).execute_and_collect(): + input_value = result[field_names.index(hashing_tf.get_input_col())] + output_value = result[field_names.index(hashing_tf.get_output_col())] + print('Input Value: ' + ' '.join(input_value) + '\tOutput Value: ' + str(output_value)) + +``` + +{{< /tab>}} + +{{< /tabs>}} diff --git a/docs/content/docs/operators/feature/interaction.md b/docs/content/docs/operators/feature/interaction.md new file mode 100644 index 0000000..bc109f4 --- /dev/null +++ b/docs/content/docs/operators/feature/interaction.md @@ -0,0 +1,169 @@ +--- +title: "Interaction" +weight: 1 +type: docs +aliases: +- /operators/feature/interaction.html +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +## Interaction + +Interaction takes vector or numerical columns, and generates a single vector column that contains +the product of all combinations of one value from each input column. + +For example, when the input feature values are Double(2) and Vector(3, 4), the output would be +Vector(6, 8). When the input feature values are Vector(1, 2) and Vector(3, 4), the output would +be Vector(3, 4, 6, 8). If you change the position of these two input Vectors, the output would +be Vector(3, 6, 4, 8). + +### Input Columns + +| Param name | Type | Default | Description | +|:-----------|:-------|:--------|:--------------------------| +| inputCols | Vector | `null` | Columns to be interacted. | + +### Output Columns + +| Param name | Type | Default | Description | +|:-----------|:-------|:-----------|:-------------------| +| outputCol | Vector | `"output"` | Interacted vector. | + +### Parameters + +| Key | Default | Type | Required | Description | +|-----------------|------------|-----------|----------|----------------------------| +| inputCols | `null` | String[] | yes | Input column names. | +| outputCol | `"output"` | String | no | Output column name. | + +### Examples + +{{< tabs examples >}} + +{{< tab "Java">}} + +```java +import org.apache.flink.ml.feature.interaction.Interaction; +import org.apache.flink.ml.linalg.Vector; +import org.apache.flink.ml.linalg.Vectors; +import org.apache.flink.streaming.api.datastream.DataStream; +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; +import org.apache.flink.table.api.Table; +import org.apache.flink.table.api.bridge.java.StreamTableEnvironment; +import org.apache.flink.types.Row; +import org.apache.flink.util.CloseableIterator; + +import java.util.Arrays; + +/** Simple program that creates an Interaction instance and uses it for feature engineering. */ +public class InteractionExample { + public static void main(String[] args) { + StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); + StreamTableEnvironment tEnv = StreamTableEnvironment.create(env); + + // Generates input data. + DataStream<Row> inputStream = + env.fromElements( + Row.of(0, Vectors.dense(1.1, 3.2), Vectors.dense(2, 3)), + Row.of(1, Vectors.dense(2.1, 3.1), Vectors.dense(1, 3))); + + Table inputTable = tEnv.fromDataStream(inputStream).as("f0", "f1", "f2"); + + // Creates an Interaction object and initializes its parameters. + Interaction interaction = + new Interaction().setInputCols("f0", "f1", "f2").setOutputCol("outputVec"); + + // Transforms input data. + Table outputTable = interaction.transform(inputTable)[0]; + + // Extracts and displays the results. + for (CloseableIterator<Row> it = outputTable.execute().collect(); it.hasNext(); ) { + Row row = it.next(); + Object[] inputValues = new Object[interaction.getInputCols().length]; + for (int i = 0; i < inputValues.length; i++) { + inputValues[i] = row.getField(interaction.getInputCols()[i]); + } + Vector outputValue = (Vector) row.getField(interaction.getOutputCol()); + System.out.printf( + "Input Values: %s \tOutput Value: %s\n", + Arrays.toString(inputValues), outputValue); + } + } +} + +``` + +{{< /tab>}} + +{{< tab "Python">}} + +```python +# Simple program that creates an Interaction instance and uses it for feature +# engineering. + +from pyflink.common import Types +from pyflink.datastream import StreamExecutionEnvironment +from pyflink.ml.core.linalg import Vectors, DenseVectorTypeInfo +from pyflink.ml.lib.feature.interaction import Interaction +from pyflink.table import StreamTableEnvironment + +# create a new StreamExecutionEnvironment +env = StreamExecutionEnvironment.get_execution_environment() + +# create a StreamTableEnvironment +t_env = StreamTableEnvironment.create(env) + +# generate input data +input_data_table = t_env.from_data_stream( + env.from_collection([ + (1, + Vectors.dense(1, 2), + Vectors.dense(3, 4)), + (2, + Vectors.dense(2, 8), + Vectors.dense(3, 4)) + ], + type_info=Types.ROW_NAMED( + ['f0', 'f1', 'f2'], + [Types.INT(), DenseVectorTypeInfo(), DenseVectorTypeInfo()]))) + +# create an interaction object and initialize its parameters +interaction = Interaction() \ + .set_input_cols('f0', 'f1', 'f2') \ + .set_output_col('interaction_vec') + +# use the interaction for feature engineering +output = interaction.transform(input_data_table)[0] + +# extract and display the results +field_names = output.get_schema().get_field_names() +input_values = [None for _ in interaction.get_input_cols()] +for result in t_env.to_data_stream(output).execute_and_collect(): + for i in range(len(interaction.get_input_cols())): + input_values[i] = result[field_names.index(interaction.get_input_cols()[i])] + output_value = result[field_names.index(interaction.get_output_col())] + print('Input Values: ' + str(input_values) + '\tOutput Value: ' + str(output_value)) + +``` + +{{< /tab>}} + +{{< /tabs>}} diff --git a/docs/content/docs/operators/feature/minmaxscaler.md b/docs/content/docs/operators/feature/maxabsscaler.md similarity index 69% copy from docs/content/docs/operators/feature/minmaxscaler.md copy to docs/content/docs/operators/feature/maxabsscaler.md index 8b1ff6c..b57490e 100644 --- a/docs/content/docs/operators/feature/minmaxscaler.md +++ b/docs/content/docs/operators/feature/maxabsscaler.md @@ -1,9 +1,9 @@ --- -title: "Min Max Scaler" +title: "Max Abs Scaler" weight: 1 type: docs aliases: -- /operators/feature/minmaxscaler.html +- /operators/feature/maxabsscaler.html --- <!-- @@ -25,30 +25,30 @@ specific language governing permissions and limitations under the License. --> -## Min Max Scaler +## Max Abs Scaler + +Max Abs Scaler is an algorithm rescales feature values to the range [-1, 1] +by dividing through the largest maximum absolute value in each feature. +It does not shift/center the data and thus does not destroy any sparsity. -Min Max Scaler is an algorithm that rescales feature values to a common range -[min, max] which defined by user. ### Input Columns -| Param name | Type | Default | Description | -| :--------- | :----- | :-------- | :-------------------- | -| inputCol | Vector | `"input"` | features to be scaled | +| Param name | Type | Default | Description | +|:-----------|:-------|:----------|:-----------------------| +| inputCol | Vector | `"input"` | Features to be scaled. | ### Output Columns -| Param name | Type | Default | Description | -| :--------- | :----- | :--------- | :-------------- | -| outputCol | Vector | `"output"` | scaled features | +| Param name | Type | Default | Description | +|:-----------|:-------|:-----------|:-----------------| +| outputCol | Vector | `"output"` | Scaled features. | ### Parameters -| Key | Default | Type | Required | Description | -| --------- | ---------- | ------ | -------- | ---------------------------------------- | -| inputCol | `"input"` | String | no | Input column name. | -| outputCol | `"output"` | String | no | Output column name. | -| min | `0.0` | Double | no | Lower bound of the output feature range. | -| max | `1.0` | Double | no | Upper bound of the output feature range. | +| Key | Default | Type | Required | Description | +|-----------|------------|--------|----------|---------------------| +| inputCol | `"input"` | String | no | Input column name. | +| outputCol | `"output"` | String | no | Output column name. | ### Examples @@ -57,8 +57,8 @@ Min Max Scaler is an algorithm that rescales feature values to a common range {{< tab "Java">}} ```java -import org.apache.flink.ml.feature.minmaxscaler.MinMaxScaler; -import org.apache.flink.ml.feature.minmaxscaler.MinMaxScalerModel; +import org.apache.flink.ml.feature.maxabsscaler.MaxAbsScaler; +import org.apache.flink.ml.feature.maxabsscaler.MaxAbsScalerModel; import org.apache.flink.ml.linalg.DenseVector; import org.apache.flink.ml.linalg.Vectors; import org.apache.flink.streaming.api.datastream.DataStream; @@ -68,8 +68,8 @@ import org.apache.flink.table.api.bridge.java.StreamTableEnvironment; import org.apache.flink.types.Row; import org.apache.flink.util.CloseableIterator; -/** Simple program that trains a MinMaxScaler model and uses it for feature engineering. */ -public class MinMaxScalerExample { +/** Simple program that trains a MaxAbsScaler model and uses it for feature engineering. */ +public class MaxAbsScalerExample { public static void main(String[] args) { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); StreamTableEnvironment tEnv = StreamTableEnvironment.create(env); @@ -91,20 +91,20 @@ public class MinMaxScalerExample { Row.of(Vectors.dense(100.0, 50.0))); Table predictTable = tEnv.fromDataStream(predictStream).as("input"); - // Creates a MinMaxScaler object and initializes its parameters. - MinMaxScaler minMaxScaler = new MinMaxScaler(); + // Creates a MaxAbsScaler object and initializes its parameters. + MaxAbsScaler maxAbsScaler = new MaxAbsScaler(); - // Trains the MinMaxScaler Model. - MinMaxScalerModel minMaxScalerModel = minMaxScaler.fit(trainTable); + // Trains the MaxAbsScaler Model. + MaxAbsScalerModel maxAbsScalerModel = maxAbsScaler.fit(trainTable); - // Uses the MinMaxScaler Model for predictions. - Table outputTable = minMaxScalerModel.transform(predictTable)[0]; + // Uses the MaxAbsScaler Model for predictions. + Table outputTable = maxAbsScalerModel.transform(predictTable)[0]; // Extracts and displays the results. for (CloseableIterator<Row> it = outputTable.execute().collect(); it.hasNext(); ) { Row row = it.next(); - DenseVector inputValue = (DenseVector) row.getField(minMaxScaler.getInputCol()); - DenseVector outputValue = (DenseVector) row.getField(minMaxScaler.getOutputCol()); + DenseVector inputValue = (DenseVector) row.getField(maxAbsScaler.getInputCol()); + DenseVector outputValue = (DenseVector) row.getField(maxAbsScaler.getOutputCol()); System.out.printf("Input Value: %-15s\tOutput Value: %s\n", inputValue, outputValue); } } @@ -117,13 +117,13 @@ public class MinMaxScalerExample { {{< tab "Python">}} ```python -# Simple program that trains a MinMaxScaler model and uses it for feature +# Simple program that trains a MaxAbsScaler model and uses it for feature # engineering. from pyflink.common import Types from pyflink.datastream import StreamExecutionEnvironment from pyflink.ml.core.linalg import Vectors, DenseVectorTypeInfo -from pyflink.ml.lib.feature.minmaxscaler import MinMaxScaler +from pyflink.ml.lib.feature.maxabsscaler import MaxAbsScaler from pyflink.table import StreamTableEnvironment # create a new StreamExecutionEnvironment @@ -157,20 +157,20 @@ predict_data = t_env.from_data_stream( [DenseVectorTypeInfo()]) )) -# create a min-max-scaler object and initialize its parameters -min_max_scaler = MinMaxScaler() +# create a maxabs scaler object and initialize its parameters +max_abs_scaler = MaxAbsScaler() -# train the min-max-scaler model -model = min_max_scaler.fit(train_data) +# train the maxabs scaler model +model = max_abs_scaler.fit(train_data) -# use the min-max-scaler model for predictions +# use the maxabs scaler model for predictions output = model.transform(predict_data)[0] # extract and display the results field_names = output.get_schema().get_field_names() for result in t_env.to_data_stream(output).execute_and_collect(): - input_value = result[field_names.index(min_max_scaler.get_input_col())] - output_value = result[field_names.index(min_max_scaler.get_output_col())] + input_value = result[field_names.index(max_abs_scaler.get_input_col())] + output_value = result[field_names.index(max_abs_scaler.get_output_col())] print('Input Value: ' + str(input_value) + ' \tOutput Value: ' + str(output_value)) ``` diff --git a/docs/content/docs/operators/feature/minmaxscaler.md b/docs/content/docs/operators/feature/minmaxscaler.md index 8b1ff6c..3e7b17f 100644 --- a/docs/content/docs/operators/feature/minmaxscaler.md +++ b/docs/content/docs/operators/feature/minmaxscaler.md @@ -31,20 +31,20 @@ Min Max Scaler is an algorithm that rescales feature values to a common range [min, max] which defined by user. ### Input Columns -| Param name | Type | Default | Description | -| :--------- | :----- | :-------- | :-------------------- | -| inputCol | Vector | `"input"` | features to be scaled | +| Param name | Type | Default | Description | +|:-----------|:-------|:----------|:-----------------------| +| inputCol | Vector | `"input"` | Features to be scaled. | ### Output Columns -| Param name | Type | Default | Description | -| :--------- | :----- | :--------- | :-------------- | -| outputCol | Vector | `"output"` | scaled features | +| Param name | Type | Default | Description | +|:-----------|:-------|:-----------|:-----------------| +| outputCol | Vector | `"output"` | Scaled features. | ### Parameters | Key | Default | Type | Required | Description | -| --------- | ---------- | ------ | -------- | ---------------------------------------- | +|-----------|------------|--------|----------|------------------------------------------| | inputCol | `"input"` | String | no | Input column name. | | outputCol | `"output"` | String | no | Output column name. | | min | `0.0` | Double | no | Lower bound of the output feature range. | diff --git a/docs/content/docs/operators/feature/onehotencoder.md b/docs/content/docs/operators/feature/onehotencoder.md index 53884f4..d0344d4 100644 --- a/docs/content/docs/operators/feature/onehotencoder.md +++ b/docs/content/docs/operators/feature/onehotencoder.md @@ -37,24 +37,24 @@ vector column for each input column. ### Input Columns -| Param name | Type | Default | Description | -| :--------- | :------ | :------ | :---------- | -| inputCols | Integer | `null` | Label index | +| Param name | Type | Default | Description | +| :--------- | :------ | :------ |:-------------| +| inputCols | Integer | `null` | Label index. | ### Output Columns -| Param name | Type | Default | Description | -| :--------- | :----- | :------ | :-------------------- | -| outputCols | Vector | `null` | Encoded binary vector | +| Param name | Type | Default | Description | +| :--------- | :----- | :------ |:-----------------------| +| outputCols | Vector | `null` | Encoded binary vector. | ### Parameters -| Key | Default | Type | Required | Description | -| ------------- | -------------------------------- | ------- | -------- | ------------------------------------------------------------ | -| inputCols | `null` | String | yes | Input column names. | -| outputCols | `null` | String | yes | Output column names. | -| handleInvalid | `HasHandleInvalid.ERROR_INVALID` | String | No | Strategy to handle invalid entries. Supported values: `HasHandleInvalid.ERROR_INVALID`, `HasHandleInvalid.SKIP_INVALID` | -| dropLast | `true` | Boolean | no | Whether to drop the last category. | +| Key | Default | Type | Required | Description | +|---------------|-----------|----------|----------|--------------------------------------------------------------------------------| +| inputCols | `null` | String[] | yes | Input column names. | +| outputCols | `null` | String[] | yes | Output column names. | +| handleInvalid | `"error"` | String | no | Strategy to handle invalid entries. Supported values: 'error', 'skip', 'keep'. | +| dropLast | `true` | Boolean | no | Whether to drop the last category. | ### Examples diff --git a/docs/content/docs/operators/feature/standardscaler.md b/docs/content/docs/operators/feature/standardscaler.md index 9bf17c1..ab17f56 100644 --- a/docs/content/docs/operators/feature/standardscaler.md +++ b/docs/content/docs/operators/feature/standardscaler.md @@ -31,20 +31,20 @@ Standard Scaler is an algorithm that standardizes the input features by removing the mean and scaling each dimension to unit variance. ### Input Columns -| Param name | Type | Default | Description | -| :--------- | :----- | :-------- | :-------------------- | -| inputCol | Vector | `"input"` | features to be scaled | +| Param name | Type | Default | Description | +|:-----------|:-------|:----------|:-----------------------| +| inputCol | Vector | `"input"` | Features to be scaled. | ### Output Columns -| Param name | Type | Default | Description | -| :--------- | :----- | :--------- | :-------------- | -| outputCol | Vector | `"output"` | scaled features | +| Param name | Type | Default | Description | +|:-----------|:-------|:-----------|:-----------------| +| outputCol | Vector | `"output"` | Scaled features. | ### Parameters | Key | Default | Type | Required | Description | -| --------- | ---------- | ------- | -------- | -------------------------------------------------- | +|-----------|------------|---------|----------|----------------------------------------------------| | inputCol | `"input"` | String | no | Input column name. | | outputCol | `"output"` | String | no | Output column name. | | withMean | `false` | Boolean | no | Whether centers the data with mean before scaling. | diff --git a/docs/content/docs/operators/feature/stringindexer.md b/docs/content/docs/operators/feature/stringindexer.md index 94096ba..36b8d80 100644 --- a/docs/content/docs/operators/feature/stringindexer.md +++ b/docs/content/docs/operators/feature/stringindexer.md @@ -38,30 +38,30 @@ StringIndexerModel. ### Input Columns | Param name | Type | Default | Description | -| :--------- | :------------ | :------ | :------------------------------------- | -| inputCols | Number/String | `null` | string/numerical values to be indexed. | +| :--------- | :------------ | :------ |:---------------------------------------| +| inputCols | Number/String | `null` | String/Numerical values to be indexed. | ### Output Columns | Param name | Type | Default | Description | -| :--------- | :----- | :------ | :---------------------------------- | +|:-----------|:-------|:--------|:------------------------------------| | outputCols | Double | `null` | Indices of string/numerical values. | ### Parameters Below are the parameters required by `StringIndexerModel`. -| Key | Default | Type | Required | Description | -| ------------- | -------------------------------- | ------ | -------- | ----------------------------------- | -| inputCols | `null` | String | yes | Input column names. | -| outputCols | `null` | String | yes | Output column names. | -| handleInvalid | `HasHandleInvalid.ERROR_INVALID` | String | No | Strategy to handle invalid entries. | +| Key | Default | Type | Required | Description | +|---------------|-----------|----------|----------|--------------------------------------------------------------------------------| +| inputCols | `null` | String[] | yes | Input column names. | +| outputCols | `null` | String[] | yes | Output column names. | +| handleInvalid | `"error"` | String | no | Strategy to handle invalid entries. Supported values: 'error', 'skip', 'keep'. | `StringIndexer` needs parameters above and also below. -| Key | Default | Type | Required | Description | -| --------------- | ------------------------------------- | ------ | -------- | ------------------------------------ | -| stringOrderType | `StringIndexerParams.ARBITRARY_ORDER` | String | no | How to order strings of each column. | +| Key | Default | Type | Required | Description | +|-----------------|---------------|--------|----------|-------------------------------------------------------------------------------------------------------------------------------------| +| stringOrderType | `"arbitrary"` | String | no | How to order strings of each column. Supported values: 'arbitrary', 'frequencyDesc', 'frequencyAsc', 'alphabetDesc', 'alphabetAsc'. | ### Examples diff --git a/docs/content/docs/operators/feature/vectorassembler.md b/docs/content/docs/operators/feature/vectorassembler.md index f5af483..10f3fc3 100644 --- a/docs/content/docs/operators/feature/vectorassembler.md +++ b/docs/content/docs/operators/feature/vectorassembler.md @@ -33,22 +33,22 @@ Types of input columns must be either vector or numerical value. ### Input Columns | Param name | Type | Default | Description | -| :--------- | :------------ | :------ | :------------------------------ | +|:-----------|:--------------|:--------|:--------------------------------| | inputCols | Number/Vector | `null` | Number/Vectors to be assembled. | ### Output Columns | Param name | Type | Default | Description | -| :--------- | :----- | :--------- | :---------------- | +|:-----------|:-------|:-----------|:------------------| | outputCol | Vector | `"output"` | Assembled vector. | ### Parameters -| Key | Default | Type | Required | Description | -| ------------- | -------------------------------- | ------ | -------- | ----------------------------------- | -| inputCols | `null` | String | yes | Input column names. | -| outputCol | `"output"` | String | No | Output column name. | -| handleInvalid | `HasHandleInvalid.ERROR_INVALID` | String | No | Strategy to handle invalid entries. | +| Key | Default | Type | Required | Description | +|---------------|------------|----------|----------|--------------------------------------------------------------------------------| +| inputCols | `null` | String[] | yes | Input column names. | +| outputCol | `"output"` | String | no | Output column name. | +| handleInvalid | `"error"` | String | no | Strategy to handle invalid entries. Supported values: 'error', 'skip', 'keep'. | ### Examples diff --git a/docs/content/docs/operators/feature/vectorslicer.md b/docs/content/docs/operators/feature/vectorslicer.md new file mode 100644 index 0000000..66a921b --- /dev/null +++ b/docs/content/docs/operators/feature/vectorslicer.md @@ -0,0 +1,158 @@ +--- +title: "Vector Slicer" +weight: 1 +type: docs +aliases: +- /operators/feature/vectorslicer.html +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +## Vector Slicer + +Vector Slicer transforms a vector to a new feature, which is a sub-array of the original +feature. It is useful for extracting features from a given vector. + +Note that duplicate features are not allowed, so there can be no overlap between selected +indices. If the max value of the indices is greater than the size of the input vector, +it throws an IllegalArgumentException. + +### Input Columns + +| Param name | Type | Default | Description | +|:-----------|:-------|:----------|:---------------------| +| inputCol | Vector | `"input"` | Vector to be sliced. | + +### Output Columns + +| Param name | Type | Default | Description | +|:-----------|:-------|:-----------|:---------------| +| outputCol | Vector | `"output"` | Sliced vector. | + +### Parameters + +| Key | Default | Type | Required | Description | +|-----------|------------|-----------|----------|---------------------------------------------------------------| +| inputCol | `"input"` | String | no | Input column name. | +| outputCol | `"output"` | String | no | Output column name. | +| indices | `null` | Integer[] | yes | An array of indices to select features from a vector column. | +### Examples + +{{< tabs examples >}} + +{{< tab "Java">}} + +```java +import org.apache.flink.ml.feature.vectorslicer.VectorSlicer; +import org.apache.flink.ml.linalg.Vector; +import org.apache.flink.ml.linalg.Vectors; +import org.apache.flink.streaming.api.datastream.DataStream; +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; +import org.apache.flink.table.api.Table; +import org.apache.flink.table.api.bridge.java.StreamTableEnvironment; +import org.apache.flink.types.Row; +import org.apache.flink.util.CloseableIterator; + +/** Simple program that creates a VectorSlicer instance and uses it for feature engineering. */ +public class VectorSlicerExample { + public static void main(String[] args) { + StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); + StreamTableEnvironment tEnv = StreamTableEnvironment.create(env); + + // Generates input data. + DataStream<Row> inputStream = + env.fromElements( + Row.of(Vectors.dense(2.1, 3.1, 1.2, 3.1, 4.6)), + Row.of(Vectors.dense(1.2, 3.1, 4.6, 2.1, 3.1))); + Table inputTable = tEnv.fromDataStream(inputStream).as("vec"); + + // Creates a VectorSlicer object and initializes its parameters. + VectorSlicer vectorSlicer = + new VectorSlicer().setInputCol("vec").setIndices(1, 2, 3).setOutputCol("slicedVec"); + + // Uses the VectorSlicer object for feature transformations. + Table outputTable = vectorSlicer.transform(inputTable)[0]; + + // Extracts and displays the results. + for (CloseableIterator<Row> it = outputTable.execute().collect(); it.hasNext(); ) { + Row row = it.next(); + + Vector inputValue = (Vector) row.getField(vectorSlicer.getInputCol()); + + Vector outputValue = (Vector) row.getField(vectorSlicer.getOutputCol()); + + System.out.printf("Input Value: %s \tOutput Value: %s\n", inputValue, outputValue); + } + } +} + +``` + +{{< /tab>}} + +{{< tab "Python">}} + +```python +# Simple program that creates a VectorSlicer instance and uses it for feature +# engineering. + +from pyflink.common import Types +from pyflink.datastream import StreamExecutionEnvironment +from pyflink.ml.core.linalg import Vectors, DenseVectorTypeInfo +from pyflink.ml.lib.feature.vectorslicer import VectorSlicer +from pyflink.table import StreamTableEnvironment + +# create a new StreamExecutionEnvironment +env = StreamExecutionEnvironment.get_execution_environment() + +# create a StreamTableEnvironment +t_env = StreamTableEnvironment.create(env) + +# generate input data +input_data_table = t_env.from_data_stream( + env.from_collection([ + (1, Vectors.dense(2.1, 3.1, 1.2, 2.1)), + (2, Vectors.dense(2.3, 2.1, 1.3, 1.2)), + ], + type_info=Types.ROW_NAMED( + ['id', 'vec'], + [Types.INT(), DenseVectorTypeInfo()]))) + +# create a vector slicer object and initialize its parameters +vector_slicer = VectorSlicer() \ + .set_input_col('vec') \ + .set_indices(1, 2, 3) \ + .set_output_col('sub_vec') + +# use the vector slicer model for feature engineering +output = vector_slicer.transform(input_data_table)[0] + +# extract and display the results +field_names = output.get_schema().get_field_names() +for result in t_env.to_data_stream(output).execute_and_collect(): + input_value = result[field_names.index(vector_slicer.get_input_col())] + output_value = result[field_names.index(vector_slicer.get_output_col())] + print('Input Value: ' + str(input_value) + '\tOutput Value: ' + str(output_value)) + +``` + +{{< /tab>}} + +{{< /tabs>}} diff --git a/docs/content/docs/operators/regression/linearregression.md b/docs/content/docs/operators/regression/linearregression.md index ff8c5e9..596b927 100644 --- a/docs/content/docs/operators/regression/linearregression.md +++ b/docs/content/docs/operators/regression/linearregression.md @@ -31,17 +31,17 @@ between a scalar response and one or more explanatory variables. ### Input Columns -| Param name | Type | Default | Description | -| :---------- | :------ | :----------- | :--------------- | -| featuresCol | Vector | `"features"` | Feature vector | -| labelCol | Integer | `"label"` | Label to predict | -| weightCol | Double | `"weight"` | Weight of sample | +| Param name | Type | Default | Description | +| :---------- | :------ | :----------- |:------------------| +| featuresCol | Vector | `"features"` | Feature vector. | +| labelCol | Integer | `"label"` | Label to predict. | +| weightCol | Double | `"weight"` | Weight of sample. | ### Output Columns -| Param name | Type | Default | Description | -| :------------ | :------ | :------------- | :--------------------------- | -| predictionCol | Integer | `"prediction"` | Label of the max probability | +| Param name | Type | Default | Description | +| :------------ | :------ | :------------- |:------------------------------| +| predictionCol | Integer | `"prediction"` | Label of the max probability. | ### Parameters