That is not supposed to happen :/ That is probably a bug. If you have the log4j logs, would be good to file a JIRA. This may be worth debugging.
On Wed, May 13, 2015 at 12:13 PM, Du Li <l...@yahoo-inc.com> wrote: > Actually I tried that before asking. However, it killed the spark context. > :-) > > Du > > > > On Wednesday, May 13, 2015 12:02 PM, Tathagata Das <t...@databricks.com> > wrote: > > > That is a good question. I dont see a direct way to do that. > > You could do try the following > > val jobGroupId = <group-id-based-on-current-time> > rdd.sparkContext.setJobGroup(jobGroupId) > val approxCount = rdd.countApprox().getInitialValue // job launched with > the set job group > rdd.sparkContext.cancelJobGroup(jobGroupId) // cancel the job > > > On Wed, May 13, 2015 at 11:24 AM, Du Li <l...@yahoo-inc.com> wrote: > > Hi TD, > > Do you know how to cancel the rdd.countApprox(5000) tasks after the > timeout? Otherwise it keeps running until completion, producing results not > used but consuming resources. > > Thanks, > Du > > > > On Wednesday, May 13, 2015 10:33 AM, Du Li <l...@yahoo-inc.com.INVALID> > wrote: > > > Hi TD, > > Thanks a lot. rdd.countApprox(5000).initialValue worked! Now my streaming > app is standing a much better chance to complete processing each batch > within the batch interval. > > Du > > > On Tuesday, May 12, 2015 10:31 PM, Tathagata Das <t...@databricks.com> > wrote: > > > From the code it seems that as soon as the " rdd.countApprox(5000)" > returns, you can call "pResult.initialValue()" to get the approximate count > at that point of time (that is after timeout). Calling > "pResult.getFinalValue()" will further block until the job is over, and > give the final correct values that you would have received by "rdd.count()" > > On Tue, May 12, 2015 at 5:03 PM, Du Li <l...@yahoo-inc.com.invalid> wrote: > > HI, > > I tested the following in my streaming app and hoped to get an approximate > count within 5 seconds. However, rdd.countApprox(5000).getFinalValue() > seemed to always return after it finishes completely, just like > rdd.count(), which often exceeded 5 seconds. The values for low, mean, and > high were the same. > > val pResult = rdd.countApprox(5000) > val bDouble = pResult.getFinalValue() > logInfo(s"countApprox().getFinalValue(): low=${bDouble.low.toLong}, > mean=${bDouble.mean.toLong}, high=${bDouble.high.toLong}") > > Can any expert here help explain the right way of usage? > > Thanks, > Du > > > > > > > > On Wednesday, May 6, 2015 7:55 AM, Du Li <l...@yahoo-inc.com.INVALID> > wrote: > > > I have to count RDD's in a spark streaming app. When data goes large, > count() becomes expensive. Did anybody have experience using countApprox()? > How accurate/reliable is it? > > The documentation is pretty modest. Suppose the timeout parameter is in > milliseconds. Can I retrieve the count value by calling getFinalValue()? > Does it block and return only after the timeout? Or do I need to define > onComplete/onFail handlers to extract count value from the partial results? > > Thanks, > Du > > > > > > > > > > >