[
https://issues.apache.org/jira/browse/FLINK-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Till Rohrmann updated FLINK-1736:
---------------------------------
Assignee: ROSHANI NAGMOTE (was: Alexander Alexandrov)
> Add CountVectorizer to machine learning library
> -----------------------------------------------
>
> Key: FLINK-1736
> URL: https://issues.apache.org/jira/browse/FLINK-1736
> Project: Flink
> Issue Type: New Feature
> Components: Machine Learning Library
> Reporter: Till Rohrmann
> Assignee: ROSHANI NAGMOTE
> Labels: ML, Starter
>
> A {{CountVectorizer}} feature extractor [1] assigns each occurring word in a
> corpus an unique identifier. With this mapping it can vectorize models such
> as bag of words or ngrams in a efficient way. The unique identifier assigned
> to a word acts as the index of a vector. The number of word occurrences is
> represented as a vector value at a specific index.
> The advantage of the {{CountVectorizer}} compared to the FeatureHasher is
> that the mapping of words to indices can be obtained which makes it easier to
> understand the resulting feature vectors.
> The {{CountVectorizer}} could be generalized to support arbitrary feature
> values.
> The {{CountVectorizer}} should be implemented as a {{Transfomer}}.
> Resources:
> [1]
> [http://scikit-learn.org/stable/modules/feature_extraction.html#common-vectorizer-usage]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)