Exactly! As a final note, `foreach` is also defined on RDDs. This means that you don't need to `collect()` the results into an array (which could give you an OutOfMemoryError in case the RDD is really really large) before printing them.
Personally, when I learn using a new library, I like to look at its Scaladoc (http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD for Spark) and test it in the REPL/worksheets (for Spark you already have `spark-shell`) best, --Jakob On Wed, Feb 10, 2016 at 3:52 PM, Mich Talebzadeh <mich.talebza...@cloudtechnologypartners.co.uk> wrote: > Many thanks Jakob. > > > > So it basically boils down to this demarcation as suggested which looks > clearer > > val errlog = sc.textFile("/unix_files/*.ksh") > errlog.filter(line => line.contains("sed")).collect().foreach(line => > println(line)) > > Regards, > > Mich > > On 10/02/2016 23:21, Jakob Odersky wrote: > > Hi Mich, > your assumptions 1 to 3 are all correct (nitpick: they're method > *calls*, the methods being the part before the parentheses, but I > assume that's what you meant). The last one is also a method call but > uses syntactic sugar on top: `foreach(println)` boils down to > `foreach(line => println(line))`. > > On an unrelated side-note, I would suggest you add a period between > every method call, it makes things easier to read and is actually > required in certain circumstances. Specifically I would add a period > before collect() and foreach(). > > best, > --Jakob > > On Wed, Feb 10, 2016 at 2:35 PM, Mich Talebzadeh > <mich.talebza...@cloudtechnologypartners.co.uk> wrote: > > Hi Chandeep Many thanks for your help In the line below errlog.filter(line > => line.contains("sed"))collect()foreach(println) Can you please clarify the > components with the correct naming as I am new to Scala errlog --> is the > RDD? filter(line => line.contains("sed")) is a method collect() is another > method ? foreach (println) ? Thanks On 10/02/2016 21:28, Chandeep Singh > wrote: Hi Mich, If you would like to print everything to the console you > could - errlog.filter(line => line.contains("sed"))collect()foreach(println) > or you could always save to a file using any of the saveAs methods. Thanks, > Chandeep On Wed, Feb 10, 2016 at 8:14 PM, > <mich.talebza...@cloudtechnologypartners.co.uk> wrote: > > Hi, I have a bunch of files stored in hdfs /unit_files directory in total > 319 files scala> val errlog = sc.textFile("/unix_files/*.ksh") scala> > errlog.filter(line => line.contains("sed"))count() res104: Long = 1113 So it > returns 1113 instances the word "sed" If I want to see the collection I can > do scala> errlog.filter(line => line.contains("sed"))collect() res105: > Array[String] = Array(" DSQUERY=${1} ; DBNAME=${2} ; ERROR=0 ; > PROGNAME=$(basename $0 | sed -e s/.ksh//)", # . in environment based on > argument for script., " exec sp_spaceused", " exec sp_spaceused", > PROGNAME=$(basename $0 | sed -e s/.ksh//), " BACKUPSERVER=$5 # Server that > is used to load the transaction dump", " BACKUPSERVER=$5 # Server that is > used to load the transaction dump", " BACKUPSERVER=$5 # Server that is used > to load the transaction dump", " cat $TMPDIR/${DBNAME}_trandump.sql | sed > s/${DSQUERY}/${REMOTESERVER}/ > $TMPDIR/${DBNAME}_trandump.tmpsql", cat > $TMPDIR/${DBNAME}_tran_transfer.sql | sed s/${DSQUERY}/${REMOTESERVER}/ > > $TMPDIR/${DBNAME}_tran_transfer.tmpsql, PROGNAME=$(basename $0 | sed -e > s/.ksh//), " B... scala> Now is there anyway I can retrieve all these > instances or perhaps they are all wrapped up and I only see few lines? > Thanks, Mich > > -- Dr Mich Talebzadeh LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > http://talebzadehmich.wordpress.com NOTE: The information in this email is > proprietary and confidential. This message is for the designated recipient > only, if you are not the intended recipient, you should destroy it > immediately. Any information in this message shall not be understood as > given or endorsed by Cloud Technology Partners Ltd, its subsidiaries or > their employees, unless expressly so stated. It is the responsibility of the > recipient to ensure that this email is virus free, therefore neither Cloud > Technology partners Ltd, its subsidiaries nor their employees accept any > responsibility. > > > > > > -- > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > http://talebzadehmich.wordpress.com > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this > message shall not be understood as given or endorsed by Cloud Technology > Partners Ltd, its subsidiaries or their employees, unless expressly so > stated. It is the responsibility of the recipient to ensure that this email > is virus free, therefore neither Cloud Technology partners Ltd, its > subsidiaries nor their employees accept any responsibility. > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org