thanks, Mayur. I will read this workload's code first.
On Mon, Jan 27, 2014 at 12:37 PM, Mayur Rustagi <mayur.rust...@gmail.com>wrote: > Now add these jars to the lib folder of the streaming project as well as in > the sparkstreamingcontext object jar list: > https://www.dropbox.com/sh/00sy9mv8qsefwc1/vsEXF0aHsJ > These are algebird jars. > > This also contains the algebird scala for streaming uniques: > https://www.dropbox.com/s/ydyn7kd75hhnnpo/Algebird.scala > > > Mayur Rustagi > Ph: +919632149971 > h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com > https://twitter.com/mayur_rustagi > > > > On Mon, Jan 27, 2014 at 11:00 PM, Mayur Rustagi <mayur.rust...@gmail.com > >wrote: > > > I can help you setup streaming with algebird for Uniques.. I suppose you > > can extend that to top K using algebird functions. > > First why dont you setup spark streaming on your machine using this > guide: > > > > > http://docs.sigmoidanalytics.com/index.php/Running_A_Simple_Streaming_Job_in_Local_Machine > > Then lemme rummage around for my algebird codebase. > > Regards > > Mayur > > > > > > Mayur Rustagi > > Ph: +919632149971 > > h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com > > https://twitter.com/mayur_rustagi > > > > > > > > On Mon, Jan 27, 2014 at 10:52 PM, dachuan <hdc1...@gmail.com> wrote: > > > >> This email, which includes my questions about spark streaming, is > >> forwarded from user@mailing-list. Sorry about this, because I haven't > >> got any reply yet. > >> > >> thanks, > >> dachuan. > >> > >> > >> ---------- Forwarded message ---------- > >> From: dachuan <hdc1...@gmail.com> > >> Date: Fri, Jan 24, 2014 at 10:28 PM > >> Subject: real world streaming code > >> To: u...@spark.incubator.apache.org > >> > >> > >> Hello, community, > >> > >> I have three questions about spark streaming. > >> > >> 1, > >> I noticed that one streaming example (StatefulNetworkWordCount) has one > >> interesting phenomenon: > >> since this workload only prints the first 10 rows of the final RDD, this > >> means if the data influx rate is fast enough (much faster than hand > typing > >> in keyboard), then the final RDD would have more than one partition, > assume > >> it's 2 partitions, but the second partition won't be computed at all > >> because the first partition suffice to serve the first 10 rows. However, > >> these two workloads must make checkpoint to that RDD. This would lead > to a > >> very time consuming checkpoint process because the checkpoint to the > second > >> partition can only start before it is computed. So, is this workload > only > >> designed for demonstration purpose, for example, only designed for one > >> partition RDD? > >> > >> (I have attached a figure to illustrate what I've said, please tell me > if > >> mailing list doesn't welcome attachment. > >> A short description about the experiment > >> Hardware specs: 4 cores > >> Software specs: spark local cluster, 5 executors (workers), each one has > >> one core, each executor has 1G memory > >> Data influx speed: 3MB/s > >> Data source: one ServerSocket in local file > >> Streaming App's name: StatefulNetworkWordCount > >> Job generation frequency: one job per second > >> Checkpoint time: once per 10s > >> JobManager.numThreads = 2) > >> > >> > >> > >> (And another workload might have the same problem: > >> PageViewStream's slidingPageCounts) > >> > >> 2, > >> Does anybody have a Top-K wordcount streaming source code? > >> > >> 3, > >> Can anybody share your real world streaming example? for example, > >> including source code, and cluster configuration details? > >> > >> thanks, > >> dachuan. > >> > >> -- > >> Dachuan Huang > >> Cellphone: 614-390-7234 > >> 2015 Neil Avenue > >> Ohio State University > >> Columbus, Ohio > >> U.S.A. > >> 43210 > >> > >> > >> > >> -- > >> Dachuan Huang > >> Cellphone: 614-390-7234 > >> 2015 Neil Avenue > >> Ohio State University > >> Columbus, Ohio > >> U.S.A. > >> 43210 > >> > > > > > -- Dachuan Huang Cellphone: 614-390-7234 2015 Neil Avenue Ohio State University Columbus, Ohio U.S.A. 43210