wuhaibo created SPARK-22039:
-------------------------------

             Summary: Spark 2.1.1 Driver OOM when use interaction for large 
scale Sparse Vector
                 Key: SPARK-22039
                 URL: https://issues.apache.org/jira/browse/SPARK-22039
             Project: Spark
          Issue Type: Question
          Components: ML
    Affects Versions: 2.1.1
            Reporter: wuhaibo


I'm working on large scale logistic regression for ctr prediction, and when 
user interaction for feature engineer, driver OOM. For detail, I interact among 
userid(one-hot, 30w dimension, sparse) and base features(60 dimensions, dense), 
driver memory is set to 40g.

So, I try to debug from remote, and I find the spark interaction create a big 
schema, and a lot job is doing at the driver.

there is two question:

By reading source, I found interaction is implemented with sparse vector, so it 
does not need so much memory, and why it need do this at the driver. The 
interaction result is 1800w dimension sparse dataframe, why 1800w structField 
for schema is so big. this is dump file when the schema begins to create 
because it is too big, I can't dump all: 
https://i.stack.imgur.com/h0XBf.jpg

So I implement interaction method with RDD, the job can finish in 5mim, so I am 
wondering it's there any wrong here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to