[ 
https://issues.apache.org/jira/browse/SPARK-35423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354810#comment-17354810
 ] 

shahid commented on SPARK-35423:
--------------------------------

I would like to analyse this issue

> The output of PCA is inconsistent
> ---------------------------------
>
>                 Key: SPARK-35423
>                 URL: https://issues.apache.org/jira/browse/SPARK-35423
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 3.1.1
>         Environment: Spark Version: 3.1.1 
>            Reporter: cqfrog
>            Priority: Major
>
> 1. The example from doc
>  
> {code:java}
> import org.apache.spark.ml.feature.PCA
> import org.apache.spark.ml.linalg.Vectors
> val data = Array(
>   Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
>   Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
>   Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
> )
> val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
> val pca = new PCA()
>   .setInputCol("features")
>   .setOutputCol("pcaFeatures")
>   .setK(3)
>   .fit(df)
> val result = pca.transform(df).select("pcaFeatures")
> result.show(false)
> {code}
>  
>  
> the output show:
> {code:java}
> +-----------------------------------------------------------+
> |pcaFeatures                                                |
> +-----------------------------------------------------------+
> |[1.6485728230883807,-4.013282700516296,-5.524543751369388] |
> |[-4.645104331781534,-1.1167972663619026,-5.524543751369387]|
> |[-6.428880535676489,-5.337951427775355,-5.524543751369389] |
> +-----------------------------------------------------------+
> {code}
> 2. change the Vector format
> I modified the code from "Vectors.sparse(5, Seq((1, 1.0), (3, 7.0)))" to 
> "Vectors.dense(0.0,1.0,0.0,7.0,0.0)" 。
> but the output show:
> {code:java}
> +------------------------------------------------------------+
> |pcaFeatures                                                 |
> +------------------------------------------------------------+
> |[1.6485728230883814,-4.0132827005162985,-1.0091435193998504]|
> |[-4.645104331781533,-1.1167972663619048,-1.0091435193998501]|
> |[-6.428880535676488,-5.337951427775359,-1.009143519399851]  |
> +------------------------------------------------------------+
> {code}
> It's strange that the two outputs are inconsistent. Why?
> Thanks.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to