[ 
https://issues.apache.org/jira/browse/SPARK-9062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630934#comment-14630934
 ] 

yuhao yang edited comment on SPARK-9062 at 7/17/15 8:03 AM:
------------------------------------------------------------

Currently it seems Word2Vec is able to handle the null value. 
{code}
    val sentence = "a b " * 100 + "a c " * 10
    val df = sqlContext.createDataFrame(Seq(
      (0, sentence.split("\\s+").toSeq),
      (1, sentence.split("\\s+").toSeq),
      (2, Array("a", "b").toSeq),
      (3, Array(null, null).toSeq)
    )).toDF("id", "words")

    val w2vModel = new Word2Vec()
      .setInputCol("words")
      .setOutputCol("features")
      .fit(df)
    val output = w2vModel.transform(df).collect()
    output.foreach { p =>
      val features = p.getAs[Vector]("features")
      println(features)
    }
{code}
I'll go ahead and add an ut if necessary.


was (Author: yuhaoyan):
Currently it seems Word2Vec is able to handle the null value. 
    val sentence = "a b " * 100 + "a c " * 10
    val df = sqlContext.createDataFrame(Seq(
      (0, sentence.split("\\s+").toSeq),
      (1, sentence.split("\\s+").toSeq),
      (2, Array("a", "b").toSeq),
      (3, Array(null, null).toSeq)
    )).toDF("id", "words")

    val w2vModel = new Word2Vec()
      .setInputCol("words")
      .setOutputCol("features")
      .fit(df)
    val output = w2vModel.transform(df).collect()
    output.foreach { p =>
      val features = p.getAs[Vector]("features")
      println(features)
    }
I'll go ahead and add an ut if necessary.

> Change output type of Tokenizer to Array(String, true)
> ------------------------------------------------------
>
>                 Key: SPARK-9062
>                 URL: https://issues.apache.org/jira/browse/SPARK-9062
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>            Reporter: yuhao yang
>            Priority: Minor
>
> Currently output type of Tokenizer is Array(String, false), which is not 
> compatible with Word2Vec and Other transformers since their input type is 
> Array(String, true). Seq[String] in udf will be treated as Array(String, 
> true) by default. 
> I'm also thinking for Nullable columns, maybe tokenizer should return 
> Array(null) for null value in the input.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to