[
https://issues.apache.org/jira/browse/FLINK-31189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Fan Hong updated FLINK-31189:
-----------------------------
Description:
Real-world datasets often contain categorical features with millions of
distinct values, some of which may only appear a few times. To maximize the
performance of certain algorithms, it is important to treat these less frequent
values properly. A popular approach is to put them to a special index, as is
done in sklearn's OneHotEncoder [1].
[1]
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
was:
In real-world datasets, categorical features may have millions of distinct
values, while some of them may only occur few times. Special handling of less
frequent values can bring performance increase in some algorithms.
One
> Allow special handle of less frequent values in StringIndexer
> -------------------------------------------------------------
>
> Key: FLINK-31189
> URL: https://issues.apache.org/jira/browse/FLINK-31189
> Project: Flink
> Issue Type: Improvement
> Components: Library / Machine Learning
> Reporter: Fan Hong
> Priority: Major
>
> Real-world datasets often contain categorical features with millions of
> distinct values, some of which may only appear a few times. To maximize the
> performance of certain algorithms, it is important to treat these less
> frequent values properly. A popular approach is to put them to a special
> index, as is done in sklearn's OneHotEncoder [1].
>
> [1]
> https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
--
This message was sent by Atlassian Jira
(v8.20.10#820010)