[ https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408513#comment-15408513 ]
Alok Singh commented on SPARK-16611: ------------------------------------ Hi [~shivaram] Thanks for the reply. 1)To illustrate what I meant by the broadcast varb issue. Please refer to the following example for example randomMatBr <- broadcast(sc, randomMat) worker <- function(r) {list(r[[0]] +1)} o1<- dapply(df, worker, out_sch) # case 1 o2<- lapply(df, worker) # case 2 useBroadcast <- function(x) { sum(value(randomMatBr) * x)} o3 <- lapply(df.toRdd, useBroadcast) # case3 Notes: - user intends to use the case3 so he created the broadcast array. but he also want to compute either o1 or o2 (for other use cases). so in the case1 and case2, he know that he will never use the broadcast elements. but in the case1, the framework will anyway ship the element in the ls(broadcastArr) to each nodes. in case2, it won't. 2) If one has one way of getting the RDD from dataframe i.e toRDD as you had suggested, it would be great :) but is it going to work with the pipeline RDD,df too? Here is one example to illustrate the point # read.csv custom parseFields <- function(record) { Sys.setlocale("LC_ALL", "C") # necessary for strsplit() to work correctly nrecord<- as.character(record); parts <- strsplit(nrecord, ",")[[1]] list(id=parts[1], title=parts[2], modified=parts[3], text=parts[4], username=parts[5]) } pr=SparkR:::lapply(f, parseFields) cache(pr) pr sch=structType(structField("id", "string"), structField("title", "string"), structField("modified", "string"), structField("text", "string"), structField("username", "string")) air_df <- createDataFrame(sqlContext, pr, sch) # now we pass in air_df's RDD to systemML the current air_df is the pipeline df and getJRDD will returns the proper RDD but if we use toRDD . my last experiment didn't work properly. # please note that, in 2.0 we will have read.csv but the point is that user can have any pipelined RDD and dataframe. does toRDD also will work with pipeline RDD,dataframe? Thanks for the confirmation that, we are not removing the RDD yet and only rename is the goal :) Alok > Expose several hidden DataFrame/RDD functions > --------------------------------------------- > > Key: SPARK-16611 > URL: https://issues.apache.org/jira/browse/SPARK-16611 > Project: Spark > Issue Type: Improvement > Components: SparkR > Reporter: Oscar D. Lara Yejas > > Expose the following functions: > - lapply or map > - lapplyPartition or mapPartition > - flatMap > - RDD > - toRDD > - getJRDD > - cleanup.jobj > cc: > [~javierluraschi] [~j...@rstudio.com] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org