cqfrog created SPARK-35423:
------------------------------

             Summary: The output of PCA is inconsistent
                 Key: SPARK-35423
                 URL: https://issues.apache.org/jira/browse/SPARK-35423
             Project: Spark
          Issue Type: Bug
          Components: MLlib
    Affects Versions: 3.1.1
         Environment: Spark Version: 3.1.1 
            Reporter: cqfrog


1. The example from doc

 
{code:java}
import org.apache.spark.ml.feature.PCA
import org.apache.spark.ml.linalg.Vectors

val data = Array(
  Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
  Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
  Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
)
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")

val pca = new PCA()
  .setInputCol("features")
  .setOutputCol("pcaFeatures")
  .setK(3)
  .fit(df)

val result = pca.transform(df).select("pcaFeatures")
result.show(false)
{code}
 

 

the output show:
{code:java}
+-----------------------------------------------------------+
|pcaFeatures                                                |
+-----------------------------------------------------------+
|[1.6485728230883807,-4.013282700516296,-5.524543751369388] |
|[-4.645104331781534,-1.1167972663619026,-5.524543751369387]|
|[-6.428880535676489,-5.337951427775355,-5.524543751369389] |
+-----------------------------------------------------------+
{code}
2. change the Vector format

I modified the code from "Vectors.sparse(5, Seq((1, 1.0), (3, 7.0)))" to 
"Vectors.dense(0.0,1.0,0.0,7.0,0.0)" 。

but the output show:
{code:java}
+------------------------------------------------------------+
|pcaFeatures                                                 |
+------------------------------------------------------------+
|[1.6485728230883814,-4.0132827005162985,-1.0091435193998504]|
|[-4.645104331781533,-1.1167972663619048,-1.0091435193998501]|
|[-6.428880535676488,-5.337951427775359,-1.009143519399851]  |
+------------------------------------------------------------+
{code}
It's strange that the two outputs are inconsistent. Why?

Thanks.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to