[ https://issues.apache.org/jira/browse/SPARK-24666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978077#comment-16978077 ]
ZhongYu commented on SPARK-24666: --------------------------------- Hi [~viirya] and [~holden], I put data and code to reproduce this issues. > Word2Vec generate infinity vectors when numIterations are large > --------------------------------------------------------------- > > Key: SPARK-24666 > URL: https://issues.apache.org/jira/browse/SPARK-24666 > Project: Spark > Issue Type: Bug > Components: ML, MLlib > Affects Versions: 2.3.1, 2.4.4 > Environment: 2.0.X, 2.1.X, 2.2.X, 2.3.X, 2.4.X > Reporter: ZhongYu > Priority: Critical > > We found that Word2Vec generate large absolute value vectors when > numIterations are large, and if numIterations are large enough (>20), the > vector's value many be *infinity(or -**infinity)***, resulting in useless > vectors. > In normal situations, vectors values are mainly around -1.0~1.0 when > numIterations = 1. > The bug is shown on spark 2.0.X, 2.1.X, 2.2.X, 2.3.X, 2.4.X > There are already issues report this bug: > https://issues.apache.org/jira/browse/SPARK-5261 , but the bug fix works > seems missing. > Other people's reports: > [https://stackoverflow.com/questions/49741956/infinity-vectors-in-spark-mllib-word2vec] > [http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-outputs-Infinity-Infinity-vectors-with-increasing-iterations-td29020.html] > ======================================================= > Here are the code to reproduce the issue. You can download title.akas.tsv > from [https://datasets.imdbws.com/] and upload to hdfs. > > {code:java} > import org.apache.spark.sql.SparkSession > import org.apache.spark.ml.feature.Word2Vec > case class Sentences(name: String, words: Array[String]) > import spark.implicits._ > // IMDB raw data title.akas.tsv download from https://datasets.imdbws.com/ > val dataset = spark.read > .option("header", "true").option("sep", "\t") > .option("quote", "").option("nullValue", "\\N") > .csv("/tmp/word2vec/title.akas.tsv") > .filter("region = 'US' or language = 'en'") > .select("title") > .as[String] > .map(s => Sentences(s, s.split(' '))) > .persist() > println("Training model...") > val word2Vec = new Word2Vec() > .setInputCol("words") > .setOutputCol("vector") > .setVectorSize(64) > .setWindowSize(4) > .setNumPartitions(50) > .setMinCount(5) > .setMaxIter(20) > val model = word2Vec.fit(dataset) > model.getVectors.show() > {code} > When set maxIter to 30, you will get the result. > {code:java} > scala> model.getVectors.show() > +-------------+--------------------+ > | word| vector| > +-------------+--------------------+ > | Unspoken|[-Infinity,-Infin...| > | Talent|[Infinity,-Infini...| > | Hourglass|[1.09657520526310...| > |Nickelodeon's|[2.20436549446219...| > | Priests|[-1.9625896848389...| > | Religion:|[-3.8815759928213...| > | Bu|[-7.9722236466752...| > | Totoro:|[-4.1829056206528...| > | Trouble,|[2.51985378203136...| > | Hatter|[8.49108115961009...| > | '79|[-5.4560309784650...| > | Vile|[-1.2059769646379...| > | 9/11|[Infinity,-Infini...| > | Santino|[6.30405421282099...| > | Motives|[1.96207712570869...| > | '13|[-1.7641987324084...| > | Fierce|[-Infinity,Infini...| > | Stover|[5.10057474120744...| > | 'It|[1.08629989605664...| > | Butts|[Infinity,Infinit...| > +-------------+--------------------+ > only showing top 20 rows > {code} > In this case, set maxIter to 20 may not generate Infinity but very large > absolute values. It depends on the training data sample and other > configurations. > {code:java} > scala> model.getVectors.show(2,false) > +--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ > |word |vector > > > > > > > > > > > > > > > > > > | > +--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ > |Unspoken|[-8.345756381631837E26,-4.521902763541592E26,-2.3382486258889084E27,-1.0244081299466769E27,-2.0078509112460803E27,-1.6760533100889865E27,-2.582670788770659E27,-3.38100521565687E26,1.7553847873565714E27,-1.170131062449021E27,-1.6565472801835883E27,-1.5594244347657445E27,-2.5150639513558596E26,1.949539129915606E27,-7.580918216717454E26,1.2361994783015613E27,-3.152053008864166E27,-8.185652662597534E26,-5.4443628225426E25,2.245579525466733E26,-1.97655047590181E27,2.8597275293150673E26,-1.1006336920210832E27,1.6166580407985987E27,1.5272882143409825E26,-1.0115330404529906E27,-1.8895683222101184E27,2.6156506156954E27,-1.698058504881491E27,-1.5132098806248563E27,3.7327358519511804E27,1.3356636582642166E27,2.3614379909704805E26,8.96912646624494E26,1.5518857669716535E27,-3.05221863964144E27,4.399680909202177E26,-2.607914789100649E27,-1.4080384994067242E27,2.7666078487221474E27,6.946950108699123E26,-1.1122679059344192E27,-2.3621557537823886E27,9.433206702172274E26,-2.3704690372536228E27,2.5086034219659006E27,2.0173186657484236E27,-1.8448836672357273E27,-1.5081404202054957E27,2.641836064055936E26,-5.613083015733733E26,-2.1296579720982533E26,-1.6550184140347592E27,-1.9152898718506886E27,1.25699596863538E27,-2.0774912070471012E27,-1.5454685136432914E27,-2.479843324641509E27,1.5560216745669318E27,-2.2176656540799786E27,-9.628781296451031E26,1.3663974096305426E27,1.6326327735924786E27,-1.9533865304335714E27]| > |Talent > |[1.3996313289146157E31,-2.216329024373106E31,1.0729251707928603E31,-4.007120754159977E31,-7.217488429248302E30,3.579654497535965E31,2.7979270365837212E31,4.333613174196825E31,3.2947832174019738E31,-1.770444782887265E31,-1.1996572271408077E31,1.9686960444755403E31,-5.211369239778517E31,4.559579301984929E31,8.789691017490939E30,-3.3896103915518896E31,-2.842517781869879E31,3.653230690058367E31,1.6690004323711066E31,-1.1803405268246773E31,4.577673536512265E31,3.9686553942166427E31,-2.0779652882517364E31,9.553626958941078E29,-1.1967228014988571E31,2.667234660143298E31,-5.082234231802067E29,-5.053934698852727E31,2.911363689445293E31,4.57440169967406E31,2.296044625777839E31,3.4719839372636273E31,-4.753091634806606E30,-2.2139650908254315E31,5.747913246328898E31,-4.027332301367786E31,-3.3981312029599884E30,-3.235915541756495E31,-3.690297564613571E31,3.6645060993927487E31,2.32138854666024E31,-4.79833731565554E31,2.4538652976104142E31,4.91394707312416E30,2.2888500664401483E31,8.433142525511996E30,-2.3447174299865074E31,-3.9894235308718024E31,1.6571656530599892E31,3.743449438983912E31,5.619889452742693E31,2.0932366809902723E31,-2.2306515916821173E30,-4.2788883664425833E30,-8.754273117753689E30,-3.8767150140313846E30,-3.7649840346087072E31,-3.604430948638639E31,5.083292737026576E31,2.92915351645125E31,5.971055806972711E31,1.4773152095869043E31,5.12252479772471E31,3.035571146004139E31] > | > +--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ > only showing top 2 rows > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org