I think you're looking for RDD.foreach()<http://spark.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#foreach> .
According to the programming guide<http://spark.apache.org/docs/latest/scala-programming-guide.html> : Run a function func on each element of the dataset. This is usually done > for side effects such as updating an accumulator variable (see below) or > interacting with external storage systems. Do you really want to log something for each element of your RDD? Nick On Tue, May 6, 2014 at 3:31 PM, Diana Carroll <dcarr...@cloudera.com> wrote: > What should I do if I want to log something as part of a task? > > This is what I tried. To set up a logger, I followed the advice here: > http://py4j.sourceforge.net/faq.html#how-to-turn-logging-on-off > > logger = logging.getLogger("py4j") > logger.setLevel(logging.INFO) > logger.addHandler(logging.StreamHandler()) > > This works fine when I call it from my driver (ie pyspark): > logger.info("this works fine") > > But I want to try logging within a distributed task so I did this: > > def logTestMap(a): > logger.info("test") > return a > > myrdd.map(logTestMap).count() > > and got: > PicklingError: Can't pickle 'lock' object > > So it's trying to serialize my function and can't because of a lock object > used in logger, presumably for thread-safeness. But then...how would I do > it? Or is this just a really bad idea? > > Thanks > Diana >