sammysheep commented on issue #17907: SPARK-7856 Principal components and variance using computeSVD() URL: https://github.com/apache/spark/pull/17907#issuecomment-575315424 @ghoto @srowen - One can increase executor, driver and `spark.driver.maxResultSize` limits and you'll still get these OOM errors using very wide matrices with PCA. It appears to be related to Java byte array limitations. [This JIRA addressed some 2G limits](https://issues.apache.org/jira/browse/SPARK-6235), but returning large results to the driver was explicitly not addressed. - R has two routines, `prcomp` and `princomp` that do PCA with SVD and eigenvalue decomposition respectively. Incidentally the R authors also prefer SVD for numerical reasons. However, having both methods could be advantageous as discussed. One could add an "algorithm" argument that would let the user select the algorithm in the PCA constructor, otherwise defaulting to the legacy method. - In certain disciplines, like bioinformatics, wide matrices are very common. They are also common if you only have a distance matrix and can't get a skinny set of observations. Current Spark (2.4.4) PCA on such matrices will be 100% inefficient since they won't work without a fix.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
