[ https://issues.apache.org/jira/browse/SPARK-3541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148995#comment-14148995 ]
Xiangrui Meng commented on SPARK-3541: -------------------------------------- Wrote a new implementation that gives ~5x speedup and better scalability, while the algorithm doesn't change. 1. Use `(Int, Int, Float)` as input, saving 4 bytes on rating. 2. Out link blocks are stored as `out: Array[Array[Int]]`, where out(dstBlockId) contains the src indices (local to the src block) associated with the dstBlockId. 3. In link blocks are stored in a CSC format: {code} class InBlock( srcIds: Array[Int], dstPtrs: Array[Int], dstEncodedIndices: Array[Int], ratings: Array[Float]) {code} `dstEncodedIndices` contains encoded `dstBlockId` (high bits) and `dstLocalIndex` (low bits). Using this data structure, the subproblems can be solved one after another without allocating many AtA/Atb buffers. 4. The input ratings are stored in small batches to avoid ser/de overhead. 5. LAPACK's dppsv is used instead of dposv. The former only needs the triangular part. Double is used for constructing the normal equation for accuracy. 6. Use `TimSort` to create the `InBlock`. I will share the code soon, which is a little messy at this time. > Improve ALS internal storage > ---------------------------- > > Key: SPARK-3541 > URL: https://issues.apache.org/jira/browse/SPARK-3541 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib > Reporter: Xiangrui Meng > Assignee: Xiangrui Meng > Original Estimate: 96h > Remaining Estimate: 96h > > The internal storage of ALS uses many small objects, which increases the GC > pressure and makes ALS difficult to scale to very large scale, e.g., 50 > billion ratings. In such cases, the full GC may take more than 10 minutes to > finish. That is longer than the default heartbeat timeout and hence executors > will be removed under default settings. > We can use primitive arrays to reduce the number of objects significantly. > This requires big change to the ALS implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org