[
https://issues.apache.org/jira/browse/SPARK-22039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-22039.
----------------------------------
Resolution: Invalid
Questions should go the mailing list. Let's start this on the mailing list
first rather than describing this as a JIRA now.
> Spark 2.1.1 Driver OOM when use interaction for large scale Sparse Vector
> -------------------------------------------------------------------------
>
> Key: SPARK-22039
> URL: https://issues.apache.org/jira/browse/SPARK-22039
> Project: Spark
> Issue Type: Question
> Components: ML
> Affects Versions: 2.1.1
> Reporter: wuhaibo
>
> I'm working on large scale logistic regression for ctr prediction, and when
> user interaction for feature engineer, driver OOM. For detail, I interact
> among userid(one-hot, 30w dimension, sparse) and base features(60 dimensions,
> dense), driver memory is set to 40g.
> So, I try to debug from remote, and I find the spark interaction create a big
> schema, and a lot job is doing at the driver.
> there is two question:
> By reading source, I found interaction is implemented with sparse vector, so
> it does not need so much memory, and why it need do this at the driver. The
> interaction result is 1800w dimension sparse dataframe, why 1800w structField
> for schema is so big. this is dump file when the schema begins to create
> because it is too big, I can't dump all:
> https://i.stack.imgur.com/h0XBf.jpg
> So I implement interaction method with RDD, the job can finish in 5mim, so I
> am wondering it's there any wrong here.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]