On 23 Aug 2016, at 17:58, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:

In general depending what you are doing you can tighten above parameters. For 
example if you are using Spark Streaming for Anti-fraud detection, you may 
stream data in at 2 seconds batch interval, Keep your windows length at 4 
seconds and your sliding intervall = 2 seconds which gives you a kind of tight 
streaming. You are aggregating data that you are collecting over the  batch 
Window.

I should warn that in https://github.com/apache/spark/pull/14731 I've been 
trying to speed up input scanning against object stores, and collecting numbers 
on the way

*if you are using the FileInputDStream to scan s3, azure (and persumably gcs) 
object stores for data, the time to scan a moderately complex directory tree is 
going to be measurable in seconds*

It's going to depend on distance from the object store and number of files, but 
you'll probably need to use a bigger window

(that patch for SPARK-17159 should improve things ... I'd love some people to 
help by testing it or emailing me direct with any (anonymised) list of what 
their directory structures used in object store FileInputDStream streams that I 
could regenerate for inclusion in some performance tests.


Reply via email to