[
https://issues.apache.org/jira/browse/SPARK-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15006824#comment-15006824
]
Kristina Plazonic commented on SPARK-10935:
-------------------------------------------
[~xusen] Thanks for pinging. Yes, I think I resolved it - I disabled Tungsten
and it made the error go away (in a smaller case).
However, I think I'm being super inefficient when generating the features for
this problem - because of all the joins. Do you have any pointers on that?
[~mengxr], I think it would really help data scientists to have a small
document - guide for feature assembly in Spark - what to do and what not to do
when using joins, especially if using ML i.e. DataFrames. I spent an inordinate
amount of time on that, and I'm still confused!!!
For example, should I use DataFrames at all when doing joins? Is it better to
use RDDs, because you can partition RDDs by keys, but not DataFrames (e.g. in
this example every join is by UserID, and you have 4 million users, so if you
had partitioned dataframes by UserID, every join would be local)?
Another example, when I started seeing the memory errors with joins, I started
asking myself if a whole DataFrame passed into a function is included in a
closure of function and a copy shipped off with every task, or does Spark take
account of the fact that whatever is passed as an argument of a function is a
distributed object and only a reference to every partition of the object is
passed in? I still don't really know for sure. All examples on the Spark
website and docs and even books are for scripts, not functions with RDD or
DataFrame arguments.
Thanks for any insights...
> Avito Context Ad Clicks
> -----------------------
>
> Key: SPARK-10935
> URL: https://issues.apache.org/jira/browse/SPARK-10935
> Project: Spark
> Issue Type: Sub-task
> Components: ML
> Reporter: Xiangrui Meng
>
> From [[email protected]]:
> I would love to do Avito Context Ad Clicks -
> https://www.kaggle.com/c/avito-context-ad-clicks - but it involves a lot of
> feature engineering and preprocessing. I would love to split this with
> somebody else if anybody is interested on working with this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]