Re: real world streaming code

Mayur Rustagi Mon, 27 Jan 2014 09:43:27 -0800

Now add these jars to the lib folder of the streaming project as well as in
the sparkstreamingcontext object jar list:
https://www.dropbox.com/sh/00sy9mv8qsefwc1/vsEXF0aHsJ
These are algebird jars.


This also contains the algebird scala for streaming uniques:
https://www.dropbox.com/s/ydyn7kd75hhnnpo/Algebird.scala


Mayur Rustagi
Ph: +919632149971
h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
https://twitter.com/mayur_rustagi



On Mon, Jan 27, 2014 at 11:00 PM, Mayur Rustagi <mayur.rust...@gmail.com>wrote:

> I can help you setup streaming with algebird for Uniques.. I suppose you
> can extend that to top K using algebird functions.
> First why dont you setup spark streaming on your machine using this guide:
>
> http://docs.sigmoidanalytics.com/index.php/Running_A_Simple_Streaming_Job_in_Local_Machine
> Then lemme rummage around for my algebird codebase.
> Regards
> Mayur
>
>
> Mayur Rustagi
> Ph: +919632149971
> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
> https://twitter.com/mayur_rustagi
>
>
>
> On Mon, Jan 27, 2014 at 10:52 PM, dachuan <hdc1...@gmail.com> wrote:
>
>> This email, which includes my questions about spark streaming, is
>> forwarded from user@mailing-list. Sorry about this, because I haven't
>> got any reply yet.
>>
>> thanks,
>> dachuan.
>>
>>
>> ---------- Forwarded message ----------
>> From: dachuan <hdc1...@gmail.com>
>> Date: Fri, Jan 24, 2014 at 10:28 PM
>> Subject: real world streaming code
>> To: u...@spark.incubator.apache.org
>>
>>
>> Hello, community,
>>
>> I have three questions about spark streaming.
>>
>> 1,
>> I noticed that one streaming example (StatefulNetworkWordCount) has one
>> interesting phenomenon:
>> since this workload only prints the first 10 rows of the final RDD, this
>> means if the data influx rate is fast enough (much faster than hand typing
>> in keyboard), then the final RDD would have more than one partition, assume
>> it's 2 partitions, but the second partition won't be computed at all
>> because the first partition suffice to serve the first 10 rows. However,
>> these two workloads must make checkpoint to that RDD. This would lead to a
>> very time consuming checkpoint process because the checkpoint to the second
>> partition can only start before it is computed. So, is this workload only
>> designed for demonstration purpose, for example, only designed for one
>> partition RDD?
>>
>> (I have attached a figure to illustrate what I've said, please tell me if
>> mailing list doesn't welcome attachment.
>> A short description about the experiment
>> Hardware specs: 4 cores
>> Software specs: spark local cluster, 5 executors (workers), each one has
>> one core, each executor has 1G memory
>> Data influx speed: 3MB/s
>> Data source: one ServerSocket in local file
>> Streaming App's name: StatefulNetworkWordCount
>> Job generation frequency: one job per second
>> Checkpoint time: once per 10s
>> JobManager.numThreads = 2)
>>
>>
>>
>> (And another workload might have the same problem:
>> PageViewStream's slidingPageCounts)
>>
>> 2,
>> Does anybody have a Top-K wordcount streaming source code?
>>
>> 3,
>> Can anybody share your real world streaming example? for example,
>> including source code, and cluster configuration details?
>>
>> thanks,
>> dachuan.
>>
>> --
>> Dachuan Huang
>> Cellphone: 614-390-7234
>> 2015 Neil Avenue
>> Ohio State University
>> Columbus, Ohio
>> U.S.A.
>> 43210
>>
>>
>>
>> --
>> Dachuan Huang
>> Cellphone: 614-390-7234
>> 2015 Neil Avenue
>> Ohio State University
>> Columbus, Ohio
>> U.S.A.
>> 43210
>>
>
>

Re: real world streaming code

Reply via email to