Re: logging in pyspark

Nicholas Chammas Tue, 06 May 2014 15:25:33 -0700

I think you're looking for
RDD.foreach()<http://spark.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#foreach>
.


According to the programming
guide<http://spark.apache.org/docs/latest/scala-programming-guide.html>
:

Run a function func on each element of the dataset. This is usually done
> for side effects such as updating an accumulator variable (see below) or
> interacting with external storage systems.


Do you really want to log something for each element of your RDD?

Nick


On Tue, May 6, 2014 at 3:31 PM, Diana Carroll <dcarr...@cloudera.com> wrote:

> What should I do if I want to log something as part of a task?
>
> This is what I tried.  To set up a logger, I followed the advice here:
> http://py4j.sourceforge.net/faq.html#how-to-turn-logging-on-off
>
> logger = logging.getLogger("py4j")
> logger.setLevel(logging.INFO)
> logger.addHandler(logging.StreamHandler())
>
> This works fine when I call it from my driver (ie pyspark):
> logger.info("this works fine")
>
> But I want to try logging within a distributed task so I did this:
>
> def logTestMap(a):
>      logger.info("test")
>     return a
>
> myrdd.map(logTestMap).count()
>
> and got:
> PicklingError: Can't pickle 'lock' object
>
> So it's trying to serialize my function and can't because of a lock object
> used in logger, presumably for thread-safeness.  But then...how would I do
> it?  Or is this just a really bad idea?
>
> Thanks
> Diana
>

Re: logging in pyspark

Reply via email to