I can help you setup streaming with algebird for Uniques.. I suppose you can extend that to top K using algebird functions. First why dont you setup spark streaming on your machine using this guide: http://docs.sigmoidanalytics.com/index.php/Running_A_Simple_Streaming_Job_in_Local_Machine Then lemme rummage around for my algebird codebase. Regards Mayur
Mayur Rustagi Ph: +919632149971 h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com https://twitter.com/mayur_rustagi On Mon, Jan 27, 2014 at 10:52 PM, dachuan <hdc1...@gmail.com> wrote: > This email, which includes my questions about spark streaming, is > forwarded from user@mailing-list. Sorry about this, because I haven't got > any reply yet. > > thanks, > dachuan. > > > ---------- Forwarded message ---------- > From: dachuan <hdc1...@gmail.com> > Date: Fri, Jan 24, 2014 at 10:28 PM > Subject: real world streaming code > To: u...@spark.incubator.apache.org > > > Hello, community, > > I have three questions about spark streaming. > > 1, > I noticed that one streaming example (StatefulNetworkWordCount) has one > interesting phenomenon: > since this workload only prints the first 10 rows of the final RDD, this > means if the data influx rate is fast enough (much faster than hand typing > in keyboard), then the final RDD would have more than one partition, assume > it's 2 partitions, but the second partition won't be computed at all > because the first partition suffice to serve the first 10 rows. However, > these two workloads must make checkpoint to that RDD. This would lead to a > very time consuming checkpoint process because the checkpoint to the second > partition can only start before it is computed. So, is this workload only > designed for demonstration purpose, for example, only designed for one > partition RDD? > > (I have attached a figure to illustrate what I've said, please tell me if > mailing list doesn't welcome attachment. > A short description about the experiment > Hardware specs: 4 cores > Software specs: spark local cluster, 5 executors (workers), each one has > one core, each executor has 1G memory > Data influx speed: 3MB/s > Data source: one ServerSocket in local file > Streaming App's name: StatefulNetworkWordCount > Job generation frequency: one job per second > Checkpoint time: once per 10s > JobManager.numThreads = 2) > > > > (And another workload might have the same problem: > PageViewStream's slidingPageCounts) > > 2, > Does anybody have a Top-K wordcount streaming source code? > > 3, > Can anybody share your real world streaming example? for example, > including source code, and cluster configuration details? > > thanks, > dachuan. > > -- > Dachuan Huang > Cellphone: 614-390-7234 > 2015 Neil Avenue > Ohio State University > Columbus, Ohio > U.S.A. > 43210 > > > > -- > Dachuan Huang > Cellphone: 614-390-7234 > 2015 Neil Avenue > Ohio State University > Columbus, Ohio > U.S.A. > 43210 >