[GitHub] [spark] sammysheep commented on issue #17907: SPARK-7856 Principal components and variance using computeSVD()

GitBox Thu, 16 Jan 2020 11:49:03 -0800

sammysheep commented on issue #17907: SPARK-7856 Principal components and 
variance using computeSVD()
URL: https://github.com/apache/spark/pull/17907#issuecomment-575315424
 
 
   @ghoto @srowen 
   - One can increase executor, driver and `spark.driver.maxResultSize` limits 
and you'll still get these OOM errors using very wide matrices with PCA. It 
appears to be related to Java byte array limitations. [This JIRA addressed some 
2G limits](https://issues.apache.org/jira/browse/SPARK-6235), but returning 
large results to the driver was explicitly not addressed. 
   - R has two routines, `prcomp` and `princomp` that do PCA with SVD and 
eigenvalue decomposition respectively. Incidentally the R authors also prefer 
SVD for numerical reasons. However, having both methods could be advantageous 
as discussed. One could add an "algorithm" argument that would let the user 
select the algorithm in the PCA constructor, otherwise defaulting to the legacy 
method.
   - In certain disciplines, like bioinformatics, wide matrices are very 
common. They are also common if you only have a distance matrix and can't get a 
skinny set of observations. Current Spark (2.4.4) PCA on such matrices will be 
100% inefficient since they won't work without a fix.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] sammysheep commented on issue #17907: SPARK-7856 Principal components and variance using computeSVD()

Reply via email to