Nevermind :) I found my answer in the docs for the PipedRDD /** * An RDD that pipes the contents of each parent partition through an external command * (printing them one per line) and returns the output as a collection of strings. */ private[spark] class PipedRDD[T: ClassTag](
So, this is essentially an implementation of something analgous to hadoop's streaming api. On Sun, Jul 20, 2014 at 4:09 PM, jay vyas <jayunit100.apa...@gmail.com> wrote: > According to the api docs for the pipe operator, > def pipe(command: String): RDD > <http://spark.apache.org/docs/1.0.0/api/scala/org/apache/spark/rdd/RDD.html> > [String]: Return an RDD created by piping elements to a forked external > process. > However, its not clear to me: > > Will the outputted RDD capture the standard out from the process as its > output (i assume that is the most common implementation)? > > Incidentally, I have not been able to use the pipe command to run an > external process yet, so any hints on that would be appreciated. > > -- > jay vyas > -- jay vyas