Now add these jars to the lib folder of the streaming project as well as in the sparkstreamingcontext object jar list: https://www.dropbox.com/sh/00sy9mv8qsefwc1/vsEXF0aHsJ These are algebird jars.
This also contains the algebird scala for streaming uniques: https://www.dropbox.com/s/ydyn7kd75hhnnpo/Algebird.scala Mayur Rustagi Ph: +919632149971 h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com https://twitter.com/mayur_rustagi On Mon, Jan 27, 2014 at 11:00 PM, Mayur Rustagi <mayur.rust...@gmail.com>wrote: > I can help you setup streaming with algebird for Uniques.. I suppose you > can extend that to top K using algebird functions. > First why dont you setup spark streaming on your machine using this guide: > > http://docs.sigmoidanalytics.com/index.php/Running_A_Simple_Streaming_Job_in_Local_Machine > Then lemme rummage around for my algebird codebase. > Regards > Mayur > > > Mayur Rustagi > Ph: +919632149971 > h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com > https://twitter.com/mayur_rustagi > > > > On Mon, Jan 27, 2014 at 10:52 PM, dachuan <hdc1...@gmail.com> wrote: > >> This email, which includes my questions about spark streaming, is >> forwarded from user@mailing-list. Sorry about this, because I haven't >> got any reply yet. >> >> thanks, >> dachuan. >> >> >> ---------- Forwarded message ---------- >> From: dachuan <hdc1...@gmail.com> >> Date: Fri, Jan 24, 2014 at 10:28 PM >> Subject: real world streaming code >> To: u...@spark.incubator.apache.org >> >> >> Hello, community, >> >> I have three questions about spark streaming. >> >> 1, >> I noticed that one streaming example (StatefulNetworkWordCount) has one >> interesting phenomenon: >> since this workload only prints the first 10 rows of the final RDD, this >> means if the data influx rate is fast enough (much faster than hand typing >> in keyboard), then the final RDD would have more than one partition, assume >> it's 2 partitions, but the second partition won't be computed at all >> because the first partition suffice to serve the first 10 rows. However, >> these two workloads must make checkpoint to that RDD. This would lead to a >> very time consuming checkpoint process because the checkpoint to the second >> partition can only start before it is computed. So, is this workload only >> designed for demonstration purpose, for example, only designed for one >> partition RDD? >> >> (I have attached a figure to illustrate what I've said, please tell me if >> mailing list doesn't welcome attachment. >> A short description about the experiment >> Hardware specs: 4 cores >> Software specs: spark local cluster, 5 executors (workers), each one has >> one core, each executor has 1G memory >> Data influx speed: 3MB/s >> Data source: one ServerSocket in local file >> Streaming App's name: StatefulNetworkWordCount >> Job generation frequency: one job per second >> Checkpoint time: once per 10s >> JobManager.numThreads = 2) >> >> >> >> (And another workload might have the same problem: >> PageViewStream's slidingPageCounts) >> >> 2, >> Does anybody have a Top-K wordcount streaming source code? >> >> 3, >> Can anybody share your real world streaming example? for example, >> including source code, and cluster configuration details? >> >> thanks, >> dachuan. >> >> -- >> Dachuan Huang >> Cellphone: 614-390-7234 >> 2015 Neil Avenue >> Ohio State University >> Columbus, Ohio >> U.S.A. >> 43210 >> >> >> >> -- >> Dachuan Huang >> Cellphone: 614-390-7234 >> 2015 Neil Avenue >> Ohio State University >> Columbus, Ohio >> U.S.A. >> 43210 >> > >