Re: how to use rdd.countApprox

Tathagata Das Wed, 13 May 2015 13:13:38 -0700

That is not supposed to happen :/ That is probably a bug.
If you have the log4j logs, would be good to file a JIRA. This may be worth
debugging.


On Wed, May 13, 2015 at 12:13 PM, Du Li <l...@yahoo-inc.com> wrote:

> Actually I tried that before asking. However, it killed the spark context.
> :-)
>
> Du
>
>
>
>   On Wednesday, May 13, 2015 12:02 PM, Tathagata Das <t...@databricks.com>
> wrote:
>
>
> That is a good question. I dont see a direct way to do that.
>
> You could do try the following
>
> val jobGroupId = <group-id-based-on-current-time>
> rdd.sparkContext.setJobGroup(jobGroupId)
> val approxCount = rdd.countApprox().getInitialValue   // job launched with
> the set job group
> rdd.sparkContext.cancelJobGroup(jobGroupId)           // cancel the job
>
>
> On Wed, May 13, 2015 at 11:24 AM, Du Li <l...@yahoo-inc.com> wrote:
>
> Hi TD,
>
> Do you know how to cancel the rdd.countApprox(5000) tasks after the
> timeout? Otherwise it keeps running until completion, producing results not
> used but consuming resources.
>
> Thanks,
> Du
>
>
>
>   On Wednesday, May 13, 2015 10:33 AM, Du Li <l...@yahoo-inc.com.INVALID>
> wrote:
>
>
>  Hi TD,
>
> Thanks a lot. rdd.countApprox(5000).initialValue worked! Now my streaming
> app is standing a much better chance to complete processing each batch
> within the batch interval.
>
> Du
>
>
>   On Tuesday, May 12, 2015 10:31 PM, Tathagata Das <t...@databricks.com>
> wrote:
>
>
> From the code it seems that as soon as the " rdd.countApprox(5000)"
> returns, you can call "pResult.initialValue()" to get the approximate count
> at that point of time (that is after timeout). Calling
> "pResult.getFinalValue()" will further block until the job is over, and
> give the final correct values that you would have received by "rdd.count()"
>
> On Tue, May 12, 2015 at 5:03 PM, Du Li <l...@yahoo-inc.com.invalid> wrote:
>
> HI,
>
> I tested the following in my streaming app and hoped to get an approximate
> count within 5 seconds. However, rdd.countApprox(5000).getFinalValue()
> seemed to always return after it finishes completely, just like
> rdd.count(), which often exceeded 5 seconds. The values for low, mean, and
> high were the same.
>
> val pResult = rdd.countApprox(5000)
> val bDouble = pResult.getFinalValue()
> logInfo(s"countApprox().getFinalValue(): low=${bDouble.low.toLong},
> mean=${bDouble.mean.toLong}, high=${bDouble.high.toLong}")
>
> Can any expert here help explain the right way of usage?
>
> Thanks,
> Du
>
>
>
>
>
>
>
>   On Wednesday, May 6, 2015 7:55 AM, Du Li <l...@yahoo-inc.com.INVALID>
> wrote:
>
>
> I have to count RDD's in a spark streaming app. When data goes large,
> count() becomes expensive. Did anybody have experience using countApprox()?
> How accurate/reliable is it?
>
> The documentation is pretty modest. Suppose the timeout parameter is in
> milliseconds. Can I retrieve the count value by calling getFinalValue()?
> Does it block and return only after the timeout? Or do I need to define
> onComplete/onFail handlers to extract count value from the partial results?
>
> Thanks,
> Du
>
>
>
>
>
>
>
>
>
>
>

Re: how to use rdd.countApprox

Reply via email to