[jira] [Updated] (FLINK-31189) Allow special handle of less frequent values in StringIndexer

Fan Hong (Jira) Wed, 22 Feb 2023 19:39:05 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-31189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Fan Hong updated FLINK-31189:
-----------------------------
    Description: 
Real-world datasets often contain categorical features with millions of 
distinct values, some of which may only appear a few times. To maximize the 
performance of certain algorithms, it is important to treat these less frequent 
values properly. A popular approach is to put them to a special index, as is 
done in sklearn's OneHotEncoder [1].

 

[1] 
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

  was:
In real-world datasets, categorical features may have millions of distinct 
values, while some of them may only occur few times. Special handling of less 
frequent values can bring performance increase in some algorithms.

 

One  


> Allow special handle of less frequent values in StringIndexer
> -------------------------------------------------------------
>
>                 Key: FLINK-31189
>                 URL: https://issues.apache.org/jira/browse/FLINK-31189
>             Project: Flink
>          Issue Type: Improvement
>          Components: Library / Machine Learning
>            Reporter: Fan Hong
>            Priority: Major
>
> Real-world datasets often contain categorical features with millions of 
> distinct values, some of which may only appear a few times. To maximize the 
> performance of certain algorithms, it is important to treat these less 
> frequent values properly. A popular approach is to put them to a special 
> index, as is done in sklearn's OneHotEncoder [1].
>  
> [1] 
> https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-31189) Allow special handle of less frequent values in StringIndexer

Reply via email to