[jira] [Commented] (SPARK-10935) Avito Context Ad Clicks

Kristina Plazonic (JIRA) Mon, 16 Nov 2015 07:58:05 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15006824#comment-15006824
 ]


Kristina Plazonic commented on SPARK-10935:
-------------------------------------------

[~xusen] Thanks for pinging. Yes, I think I resolved it - I disabled Tungsten 
and it made the error go away (in a smaller case). 

However, I think I'm being super inefficient when generating the features for 
this problem - because of all the joins. Do you have any pointers on that?

[~mengxr], I think it would really help data scientists to have a small 
document - guide for feature assembly in Spark - what to do and what not to do 
when using joins, especially if using ML i.e. DataFrames. I spent an inordinate 
amount of time on that, and I'm still confused!!! 

For example, should I use DataFrames at all when doing joins? Is it better to 
use RDDs, because you can partition RDDs by keys, but not DataFrames (e.g. in 
this example every join is by UserID, and you have 4 million users, so if you 
had partitioned dataframes by UserID, every join would be local)? 

Another example, when I started seeing the memory errors with joins, I started 
asking myself if a whole DataFrame passed into a function is included in a 
closure of function and a copy shipped off with every task, or does Spark take 
account of the fact that whatever is passed as an argument of a function is a 
distributed object and only a reference to every partition of the object is 
passed in? I still don't really know for sure. All examples on the Spark 
website and docs and even books are for scripts, not functions with RDD or 
DataFrame arguments. 

Thanks for any insights... 



> Avito Context Ad Clicks
> -----------------------
>
>                 Key: SPARK-10935
>                 URL: https://issues.apache.org/jira/browse/SPARK-10935
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>            Reporter: Xiangrui Meng
>
> From [[email protected]]:
> I would love to do Avito Context Ad Clicks - 
> https://www.kaggle.com/c/avito-context-ad-clicks - but it involves a lot of 
> feature engineering and preprocessing. I would love to split this with 
> somebody else if anybody is interested on working with this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-10935) Avito Context Ad Clicks

Reply via email to