Re: real world streaming code

Mayur Rustagi Mon, 27 Jan 2014 09:48:37 -0800

I can help you setup streaming with algebird for Uniques.. I suppose you
can extend that to top K using algebird functions.
First why dont you setup spark streaming on your machine using this guide:
http://docs.sigmoidanalytics.com/index.php/Running_A_Simple_Streaming_Job_in_Local_Machine
Then lemme rummage around for my algebird codebase.
Regards
Mayur



Mayur Rustagi
Ph: +919632149971
h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
https://twitter.com/mayur_rustagi



On Mon, Jan 27, 2014 at 10:52 PM, dachuan <hdc1...@gmail.com> wrote:

> This email, which includes my questions about spark streaming, is
> forwarded from user@mailing-list. Sorry about this, because I haven't got
> any reply yet.
>
> thanks,
> dachuan.
>
>
> ---------- Forwarded message ----------
> From: dachuan <hdc1...@gmail.com>
> Date: Fri, Jan 24, 2014 at 10:28 PM
> Subject: real world streaming code
> To: u...@spark.incubator.apache.org
>
>
> Hello, community,
>
> I have three questions about spark streaming.
>
> 1,
> I noticed that one streaming example (StatefulNetworkWordCount) has one
> interesting phenomenon:
> since this workload only prints the first 10 rows of the final RDD, this
> means if the data influx rate is fast enough (much faster than hand typing
> in keyboard), then the final RDD would have more than one partition, assume
> it's 2 partitions, but the second partition won't be computed at all
> because the first partition suffice to serve the first 10 rows. However,
> these two workloads must make checkpoint to that RDD. This would lead to a
> very time consuming checkpoint process because the checkpoint to the second
> partition can only start before it is computed. So, is this workload only
> designed for demonstration purpose, for example, only designed for one
> partition RDD?
>
> (I have attached a figure to illustrate what I've said, please tell me if
> mailing list doesn't welcome attachment.
> A short description about the experiment
> Hardware specs: 4 cores
> Software specs: spark local cluster, 5 executors (workers), each one has
> one core, each executor has 1G memory
> Data influx speed: 3MB/s
> Data source: one ServerSocket in local file
> Streaming App's name: StatefulNetworkWordCount
> Job generation frequency: one job per second
> Checkpoint time: once per 10s
> JobManager.numThreads = 2)
>
>
>
> (And another workload might have the same problem:
> PageViewStream's slidingPageCounts)
>
> 2,
> Does anybody have a Top-K wordcount streaming source code?
>
> 3,
> Can anybody share your real world streaming example? for example,
> including source code, and cluster configuration details?
>
> thanks,
> dachuan.
>
> --
> Dachuan Huang
> Cellphone: 614-390-7234
> 2015 Neil Avenue
> Ohio State University
> Columbus, Ohio
> U.S.A.
> 43210
>
>
>
> --
> Dachuan Huang
> Cellphone: 614-390-7234
> 2015 Neil Avenue
> Ohio State University
> Columbus, Ohio
> U.S.A.
> 43210
>

Re: real world streaming code

Reply via email to