[ 
https://issues.apache.org/jira/browse/SPARK-22039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169169#comment-16169169
 ] 

wuhaibo commented on SPARK-22039:
---------------------------------

sorry,please tell me which mailing list is?





-- 
fun coding


> Spark 2.1.1 Driver OOM when use interaction for large scale Sparse Vector
> -------------------------------------------------------------------------
>
>                 Key: SPARK-22039
>                 URL: https://issues.apache.org/jira/browse/SPARK-22039
>             Project: Spark
>          Issue Type: Question
>          Components: ML
>    Affects Versions: 2.1.1
>            Reporter: wuhaibo
>
> I'm working on large scale logistic regression for ctr prediction, and when 
> user interaction for feature engineer, driver OOM. For detail, I interact 
> among userid(one-hot, 30w dimension, sparse) and base features(60 dimensions, 
> dense), driver memory is set to 40g.
> So, I try to debug from remote, and I find the spark interaction create a big 
> schema, and a lot job is doing at the driver.
> there is two question:
> By reading source, I found interaction is implemented with sparse vector, so 
> it does not need so much memory, and why it need do this at the driver. The 
> interaction result is 1800w dimension sparse dataframe, why 1800w structField 
> for schema is so big. this is dump file when the schema begins to create 
> because it is too big, I can't dump all: 
> https://i.stack.imgur.com/h0XBf.jpg
> So I implement interaction method with RDD, the job can finish in 5mim, so I 
> am wondering it's there any wrong here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to