[ https://issues.apache.org/jira/browse/SPARK-25219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Marco Gaido updated SPARK-25219: -------------------------------- Component/s: (was: Spark Submit) ML > KMeans Clustering - Text Data - Results are incorrect > ----------------------------------------------------- > > Key: SPARK-25219 > URL: https://issues.apache.org/jira/browse/SPARK-25219 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 2.3.0 > Reporter: Vasanthkumar Velayudham > Priority: Major > > Hello Everyone, > I am facing issues with the usage of KMeans Clustering on my text data. When > I apply clustering on my text data, after performing various transformations > such as RegexTokenizer, Stopword Processing, HashingTF, IDF, generated > clusters are not proper and one cluster is found to have lot of data points > assigned to it. > I am able to perform clustering with similar kind of processing and with the > same attributes on the SKLearn KMeans algorithm. > Upon searching in internet, I observe many have reported the same issue with > KMeans clustering library of Spark. > Request your help in fixing this issue. > Please let me know if you require any additional details. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org