RE: How to get progress information of an RDD operation

2016-02-24 Thread Wang, Ningjun (LNG-NPV)
Yes, I am looking for programmatic way of tracking progress.  
SparkListener.scala does not track at RDD item level so it will not tell how 
many items have been processed.

I wonder is there any way to track the accumulator value as it reflect the 
correct number of items processed so far?

Ningjun

From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Tuesday, February 23, 2016 2:30 PM
To: Kevin Mellott
Cc: Wang, Ningjun (LNG-NPV); user@spark.apache.org
Subject: Re: How to get progress information of an RDD operation

I think Ningjun was looking for programmatic way of tracking progress.

I took a look at:
./core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala

but there doesn't seem to exist fine grained events directly reflecting what 
Ningjun looks for.

On Tue, Feb 23, 2016 at 11:24 AM, Kevin Mellott 
<kevin.r.mell...@gmail.com<mailto:kevin.r.mell...@gmail.com>> wrote:
Have you considered using the Spark Web UI to view progress on your job? It 
does a very good job showing the progress of the overall job, as well as allows 
you to drill into the individual tasks and server activity.

On Tue, Feb 23, 2016 at 12:53 PM, Wang, Ningjun (LNG-NPV) 
<ningjun.w...@lexisnexis.com<mailto:ningjun.w...@lexisnexis.com>> wrote:
How can I get progress information of a RDD operation? For example

val lines = sc.textFile("c:/temp/input.txt")  // a RDD of millions of line
lines.foreach(line => {
handleLine(line)
})
The input.txt contains millions of lines. The entire operation take 6 hours. I 
want to print out how many lines are processed every 1 minute so user know the 
progress. How can I do that?

One way I am thinking of is to use accumulator, e.g.



val lines = sc.textFile("c:/temp/input.txt")
val acCount = sc.accumulator(0L)
lines.foreach(line => {
handleLine(line)
acCount += 1
}

However how can I print out account every 1 minutes?


Ningjun





Re: How to get progress information of an RDD operation

2016-02-23 Thread Ted Yu
I think Ningjun was looking for programmatic way of tracking progress.

I took a look at:
./core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala

but there doesn't seem to exist fine grained events directly reflecting
what Ningjun looks for.

On Tue, Feb 23, 2016 at 11:24 AM, Kevin Mellott 
wrote:

> Have you considered using the Spark Web UI to view progress on your job?
> It does a very good job showing the progress of the overall job, as well as
> allows you to drill into the individual tasks and server activity.
>
> On Tue, Feb 23, 2016 at 12:53 PM, Wang, Ningjun (LNG-NPV) <
> ningjun.w...@lexisnexis.com> wrote:
>
>> How can I get progress information of a RDD operation? For example
>>
>>
>>
>> *val *lines = sc.textFile(*"c:/temp/input.txt"*)  // a RDD of millions
>> of line
>> lines.foreach(line => {
>> handleLine(line)
>> })
>>
>> The input.txt contains millions of lines. The entire operation take 6
>> hours. I want to print out how many lines are processed every 1 minute so
>> user know the progress. How can I do that?
>>
>>
>>
>> One way I am thinking of is to use accumulator, e.g.
>>
>>
>>
>>
>>
>> *val *lines = sc.textFile(*"c:/temp/input.txt"*)
>> *val *acCount = sc.accumulator(0L)
>> lines.foreach(line => {
>> handleLine(line)
>> acCount += 1
>> }
>>
>> However how can I print out account every 1 minutes?
>>
>>
>>
>>
>>
>> Ningjun
>>
>>
>>
>
>


Re: How to get progress information of an RDD operation

2016-02-23 Thread Kevin Mellott
Have you considered using the Spark Web UI to view progress on your job? It
does a very good job showing the progress of the overall job, as well as
allows you to drill into the individual tasks and server activity.

On Tue, Feb 23, 2016 at 12:53 PM, Wang, Ningjun (LNG-NPV) <
ningjun.w...@lexisnexis.com> wrote:

> How can I get progress information of a RDD operation? For example
>
>
>
> *val *lines = sc.textFile(*"c:/temp/input.txt"*)  // a RDD of millions of
> line
> lines.foreach(line => {
> handleLine(line)
> })
>
> The input.txt contains millions of lines. The entire operation take 6
> hours. I want to print out how many lines are processed every 1 minute so
> user know the progress. How can I do that?
>
>
>
> One way I am thinking of is to use accumulator, e.g.
>
>
>
>
>
> *val *lines = sc.textFile(*"c:/temp/input.txt"*)
> *val *acCount = sc.accumulator(0L)
> lines.foreach(line => {
> handleLine(line)
> acCount += 1
> }
>
> However how can I print out account every 1 minutes?
>
>
>
>
>
> Ningjun
>
>
>