15, 2015 8:38 AM
To: 'Ilove Data'; 'Tathagata Das'
Cc: 'Akhil Das'; 'user'
Subject: RE: Join between DStream and Periodically-Changing-RDD
Then go for the second option I suggested - simply turn (keep turning) your
HDFS file (Batch RDD) into a stream of messages (outside spark streaming
; user
Subject: Re: Join between DStream and Periodically-Changing-RDD
@Akhil Das
Join two Dstreams might not be an option since I want to join stream with
historical data in HDFS folder.
@Tagatha Das @Evo Eftimov
Batch RDD to be reloaded is considerably huge compare to Dstream data since
:* user@spark.apache.org
*Subject:* Re: Join between DStream and Periodically-Changing-RDD
RDD's are immutable, why not join two DStreams?
Not sure, but you can try something like this also:
kvDstream.foreachRDD(rdd = {
val file = ssc.sparkContext.textFile(/sigmoid
RDD's are immutable, why not join two DStreams?
Not sure, but you can try something like this also:
kvDstream.foreachRDD(rdd = {
val file = ssc.sparkContext.textFile(/sigmoid/)
val kvFile = file.map(x = (x.split(,)(0), x))
rdd.join(kvFile)
})
Thanks
Best Regards
On
Data
Cc: user@spark.apache.org
Subject: Re: Join between DStream and Periodically-Changing-RDD
RDD's are immutable, why not join two DStreams?
Not sure, but you can try something like this also:
kvDstream.foreachRDD(rdd = {
val file = ssc.sparkContext.textFile(/sigmoid
Hi,
I'm trying to join DStream with interval let say 20s, join with RDD loaded
from HDFS folder which is changing periodically, let say new file is coming
to the folder for every 10 minutes.
How should it be done, considering the HDFS files in the folder is
periodically changing/adding new