[ https://issues.apache.org/jira/browse/FLINK-31623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zhipeng Zhang updated FLINK-31623: ---------------------------------- Summary: Fix DataStreamUtils#sample with approximate uniform sampling (was: Change to uniform sampling in DataStreamUtils#sample method) > Fix DataStreamUtils#sample with approximate uniform sampling > ------------------------------------------------------------ > > Key: FLINK-31623 > URL: https://issues.apache.org/jira/browse/FLINK-31623 > Project: Flink > Issue Type: Bug > Components: Library / Machine Learning > Reporter: Fan Hong > Priority: Major > Labels: pull-request-available > > Current implementation employs two-level sampling method. > However, when data instances are imbalanced distributed among partitions > (subtasks), the probabilities of instances to be sampled are different in > different partitions (subtasks), i.e., not a uniform sampling. > > In addition, one side-effect of current implementation is: one subtask has a > memory footprint of `2 * numSamples * sizeof(element)`, which could cause > unexpected OOM in some situations. -- This message was sent by Atlassian Jira (v8.20.10#820010)