[
https://issues.apache.org/jira/browse/SPARK-9062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630934#comment-14630934
]
yuhao yang edited comment on SPARK-9062 at 7/17/15 8:03 AM:
------------------------------------------------------------
Currently it seems Word2Vec is able to handle the null value.
{code}
val sentence = "a b " * 100 + "a c " * 10
val df = sqlContext.createDataFrame(Seq(
(0, sentence.split("\\s+").toSeq),
(1, sentence.split("\\s+").toSeq),
(2, Array("a", "b").toSeq),
(3, Array(null, null).toSeq)
)).toDF("id", "words")
val w2vModel = new Word2Vec()
.setInputCol("words")
.setOutputCol("features")
.fit(df)
val output = w2vModel.transform(df).collect()
output.foreach { p =>
val features = p.getAs[Vector]("features")
println(features)
}
{code}
I'll go ahead and add an ut if necessary.
was (Author: yuhaoyan):
Currently it seems Word2Vec is able to handle the null value.
val sentence = "a b " * 100 + "a c " * 10
val df = sqlContext.createDataFrame(Seq(
(0, sentence.split("\\s+").toSeq),
(1, sentence.split("\\s+").toSeq),
(2, Array("a", "b").toSeq),
(3, Array(null, null).toSeq)
)).toDF("id", "words")
val w2vModel = new Word2Vec()
.setInputCol("words")
.setOutputCol("features")
.fit(df)
val output = w2vModel.transform(df).collect()
output.foreach { p =>
val features = p.getAs[Vector]("features")
println(features)
}
I'll go ahead and add an ut if necessary.
> Change output type of Tokenizer to Array(String, true)
> ------------------------------------------------------
>
> Key: SPARK-9062
> URL: https://issues.apache.org/jira/browse/SPARK-9062
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Reporter: yuhao yang
> Priority: Minor
>
> Currently output type of Tokenizer is Array(String, false), which is not
> compatible with Word2Vec and Other transformers since their input type is
> Array(String, true). Seq[String] in udf will be treated as Array(String,
> true) by default.
> I'm also thinking for Nullable columns, maybe tokenizer should return
> Array(null) for null value in the input.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]