[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions

Alok Singh (JIRA) Thu, 04 Aug 2016 14:25:36 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408513#comment-15408513
 ]


Alok Singh commented on SPARK-16611:
------------------------------------

Hi [~shivaram]

 Thanks for the reply.

1)To illustrate what I meant by the broadcast varb issue. Please refer to the 
following example

    for example 
    randomMatBr <- broadcast(sc, randomMat)
   
   worker <- function(r) {list(r[[0]] +1)}
   o1<- dapply(df, worker, out_sch) # case 1

   o2<- lapply(df, worker) # case 2
   
  useBroadcast <- function(x) {  sum(value(randomMatBr) * x)}
  o3 <- lapply(df.toRdd, useBroadcast) # case3

 Notes:
  - user intends to use the case3 so he created the broadcast array. but he 
also want to compute either o1 or o2 (for other use cases). so in the case1 and 
case2, he know that he will never use the broadcast elements. but in the case1, 
the framework will anyway ship the element in the ls(broadcastArr) to each 
nodes.
in case2, it won't.

2) If one has one way of getting the RDD from dataframe i.e toRDD as you had 
suggested, it would be great :)
  but is it going to work with the pipeline RDD,df too?

 Here is one example to illustrate the point
 # read.csv custom
 
 parseFields <- function(record) {
   Sys.setlocale("LC_ALL", "C") # necessary for strsplit() to work correctly
   nrecord<- as.character(record); parts <- strsplit(nrecord, ",")[[1]]
   list(id=parts[1], title=parts[2], modified=parts[3], text=parts[4], 
username=parts[5]) }

  pr=SparkR:::lapply(f, parseFields)
  cache(pr)
  pr
  sch=structType(structField("id", "string"), structField("title", "string"), 
structField("modified", "string"),     structField("text", "string"), 
structField("username", "string"))
  air_df <- createDataFrame(sqlContext, pr, sch)


  # now we pass in air_df's RDD to systemML
  the current air_df is the pipeline df and getJRDD will returns the proper RDD 
but if we use toRDD . my last experiment didn't work properly.
 # please note that, in 2.0 we will have read.csv but the point is that user 
can have any pipelined RDD and dataframe. does toRDD also will work with 
pipeline RDD,dataframe?




Thanks for the confirmation that, we are not removing the RDD yet and only 
rename is the goal :)

Alok

> Expose several hidden DataFrame/RDD functions
> ---------------------------------------------
>
>                 Key: SPARK-16611
>                 URL: https://issues.apache.org/jira/browse/SPARK-16611
>             Project: Spark
>          Issue Type: Improvement
>          Components: SparkR
>            Reporter: Oscar D. Lara Yejas
>
> Expose the following functions:
> - lapply or map
> - lapplyPartition or mapPartition
> - flatMap
> - RDD
> - toRDD
> - getJRDD
> - cleanup.jobj
> cc:
> [~javierluraschi] [~j...@rstudio.com] [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions

Reply via email to