wuhaibo created SPARK-22039:
-------------------------------
Summary: Spark 2.1.1 Driver OOM when use interaction for large
scale Sparse Vector
Key: SPARK-22039
URL: https://issues.apache.org/jira/browse/SPARK-22039
Project: Spark
Issue Type: Question
Components: ML
Affects Versions: 2.1.1
Reporter: wuhaibo
I'm working on large scale logistic regression for ctr prediction, and when
user interaction for feature engineer, driver OOM. For detail, I interact among
userid(one-hot, 30w dimension, sparse) and base features(60 dimensions, dense),
driver memory is set to 40g.
So, I try to debug from remote, and I find the spark interaction create a big
schema, and a lot job is doing at the driver.
there is two question:
By reading source, I found interaction is implemented with sparse vector, so it
does not need so much memory, and why it need do this at the driver. The
interaction result is 1800w dimension sparse dataframe, why 1800w structField
for schema is so big. this is dump file when the schema begins to create
because it is too big, I can't dump all:
https://i.stack.imgur.com/h0XBf.jpg
So I implement interaction method with RDD, the job can finish in 5mim, so I am
wondering it's there any wrong here.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]