Another alternative could be use SparkStreaming's textFileStream with windowing 
capabilities.



On Friday, July 4, 2014 9:52 AM, Gianluca Privitera 
<gianluca.privite...@studio.unibo.it> wrote:
 


You should think about a custom receiver, in order to solve the problem of the 
“already collected” data.  
http://spark.apache.org/docs/latest/streaming-custom-receivers.html

Gianluca



On 04 Jul 2014, at 15:46, alessandro finamore <alessandro.finam...@polito.it> 
wrote:

Hi,
>
>I have a large dataset of text logs files on which I need to implement
>"window analysis"
>Say, extract per-minute data and do aggregated stats on the last X minutes
>
>I've to implement the windowing analysis with spark.
>This is the workflow I'm currently using
>- read a file and I create a new RDD with per-minute info
>- loop on each new minute and integrate its data with another data structure
>containing the last X minutes of data
>- apply the analysis on the updated window of data 
>
>This works but it suffer from limited parallelisms
>Do you have any recommendations/suggestion about a better implementation?
>Also, are there any recommended data collections for managing the window
>(I'm simply using Arrays for managing data)
>
>While working in this I started to investigate spark streaming.
>The problem is that I don't know if is really possible to use it on already
>collected data.
>This post seems to indicate that it should, but it is not clear to me how
>http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-windowing-Driven-by-absolutely-time-td1733.html
>
>Thanks
>
>
>
>
>
>--
>View this message in context: 
>http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806.html
>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Reply via email to