[ 
https://issues.apache.org/jira/browse/SPARK-31976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136102#comment-17136102
 ] 

Xiao Li commented on SPARK-31976:
---------------------------------

I think [~podongfeng] just wants to target this feature in Spark 3.1 instead of 
treat it as the blocker. 

Made the change. Feel free to discuss it here, if this is not what you want. 

> use MemoryUsage to control the size of block
> --------------------------------------------
>
>                 Key: SPARK-31976
>                 URL: https://issues.apache.org/jira/browse/SPARK-31976
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML, PySpark
>    Affects Versions: 3.1.0
>            Reporter: zhengruifeng
>            Priority: Major
>
> According to the performance test in 
> https://issues.apache.org/jira/browse/SPARK-31783, the performance gain is 
> mainly related to the nnz of block.
> So it maybe reasonable to control the size of block by memory usage, instead 
> of number of rows.
>  
> note1: param blockSize had already used in ALS and MLP to stack vectors 
> (expected to be dense);
> note2: we may refer to the {{Strategy.maxMemoryInMB}} in tree models;
>  
> There may be two ways to impl:
> 1, compute the sparsity of input vectors ahead of train (this can be computed 
> with other statistics computation, maybe no extra pass), and infer a 
> reasonable number of vectors to stack;
> 2, stack the input vectors adaptively, by monitoring the memory usage in a 
> block;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to