jiangxin369 commented on code in PR #174: URL: https://github.com/apache/flink-ml/pull/174#discussion_r1028782050
########## docs/content/docs/operators/feature/countvectorizer.md: ########## @@ -0,0 +1,182 @@ +--- +title: "Count Vectorizer" +weight: 1 +type: docs +aliases: +- /operators/feature/countvectorizer.html +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions dand limitations +under the License. +--> + +## Count Vectorizer + +CountVectorizer aims to help convert a collection of text documents to +vectors of token counts. When an a-priori dictionary is not available, +CountVectorizer can be used as an estimator to extract the vocabulary, +and generates a CountVectorizerModel. The model produces sparse +representations for the documents over the vocabulary, which can then +be passed to other algorithms like LDA. + +### Input Columns + +| Param name | Type | Default | Description | +|:-----------|:---------|:----------|:--------------------| +| inputCol | String[] | `"input"` | Input string array. | + +### Output Columns + +| Param name | Type | Default | Description | +|:-----------|:-------------|:-----------|:------------------------| +| outputCol | SparseVector | `"output"` | Vector of token counts. | + +### Parameters + +Below are the parameters required by `CountVectorizerModel`. + +| Key | Default | Type | Required | Description | +|------------|------------|---------|----------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| inputCol | `"input"` | String | no | Input column name. | +| outputCol | `"output"` | String | no | Output column name. | +| minTF | `1.0` | Double | no | Filter to ignore rare words in a document. For each document, terms with frequency/count less than the given threshold are ignored. If this is an integer >= 1, then this specifies a count (of times the term must appear in the document); if this is a double in [0,1), then this specifies a fraction (out of the document's token count). | Review Comment: With the current expression, users are recommended to set this param to an integer if specifies the count, which makes the meaning of this parameter clear. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
