Re: how to use rdd.countApprox

Tathagata Das Tue, 12 May 2015 22:30:07 -0700

>From the code it seems that as soon as the " rdd.countApprox(5000)"
returns, you can call "pResult.initialValue()" to get the approximate count
at that point of time (that is after timeout). Calling
"pResult.getFinalValue()" will further block until the job is over, and
give the final correct values that you would have received by "rdd.count()"


On Tue, May 12, 2015 at 5:03 PM, Du Li <l...@yahoo-inc.com.invalid> wrote:

> HI,
>
> I tested the following in my streaming app and hoped to get an approximate
> count within 5 seconds. However, rdd.countApprox(5000).getFinalValue()
> seemed to always return after it finishes completely, just like
> rdd.count(), which often exceeded 5 seconds. The values for low, mean, and
> high were the same.
>
> val pResult = rdd.countApprox(5000)
> val bDouble = pResult.getFinalValue()
> logInfo(s"countApprox().getFinalValue(): low=${bDouble.low.toLong},
> mean=${bDouble.mean.toLong}, high=${bDouble.high.toLong}")
>
> Can any expert here help explain the right way of usage?
>
> Thanks,
> Du
>
>
>
>
>
>
>
>   On Wednesday, May 6, 2015 7:55 AM, Du Li <l...@yahoo-inc.com.INVALID>
> wrote:
>
>
> I have to count RDD's in a spark streaming app. When data goes large,
> count() becomes expensive. Did anybody have experience using countApprox()?
> How accurate/reliable is it?
>
> The documentation is pretty modest. Suppose the timeout parameter is in
> milliseconds. Can I retrieve the count value by calling getFinalValue()?
> Does it block and return only after the timeout? Or do I need to define
> onComplete/onFail handlers to extract count value from the partial results?
>
> Thanks,
> Du
>
>
>

Re: how to use rdd.countApprox

Reply via email to