Relevant: https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-spark-2-3.html
This is true stream-stream join which will automatically buffer delayed data and appropriately join stuff with SQL join semantics. Please check it out :) TD On Wed, Mar 14, 2018 at 12:07 PM, Dylan Guedes <djmggue...@gmail.com> wrote: > I misread it, and thought that you question was if pyspark supports kafka > lol. Sorry! > > On Wed, Mar 14, 2018 at 3:58 PM, Aakash Basu <aakash.spark....@gmail.com> > wrote: > >> Hey Dylan, >> >> Great! >> >> Can you revert back to my initial and also the latest mail? >> >> Thanks, >> Aakash. >> >> On 15-Mar-2018 12:27 AM, "Dylan Guedes" <djmggue...@gmail.com> wrote: >> >>> Hi, >>> >>> I've been using the Kafka with pyspark since 2.1. >>> >>> On Wed, Mar 14, 2018 at 3:49 PM, Aakash Basu <aakash.spark....@gmail.com >>> > wrote: >>> >>>> Hi, >>>> >>>> I'm yet to. >>>> >>>> Just want to know, when does Spark 2.3 with 0.10 Kafka Spark Package >>>> allows Python? I read somewhere, as of now Scala and Java are the languages >>>> to be used. >>>> >>>> Please correct me if am wrong. >>>> >>>> Thanks, >>>> Aakash. >>>> >>>> On 14-Mar-2018 8:24 PM, "Georg Heiler" <georg.kf.hei...@gmail.com> >>>> wrote: >>>> >>>>> Did you try spark 2.3 with structured streaming? There watermarking >>>>> and plain sql might be really interesting for you. >>>>> Aakash Basu <aakash.spark....@gmail.com> schrieb am Mi. 14. März 2018 >>>>> um 14:57: >>>>> >>>>>> Hi, >>>>>> >>>>>> >>>>>> >>>>>> *Info (Using):Spark Streaming Kafka 0.8 package* >>>>>> >>>>>> *Spark 2.2.1* >>>>>> *Kafka 1.0.1* >>>>>> >>>>>> As of now, I am feeding paragraphs in Kafka console producer and my >>>>>> Spark, which is acting as a receiver is printing the flattened words, >>>>>> which >>>>>> is a complete RDD operation. >>>>>> >>>>>> *My motive is to read two tables continuously (being updated) as two >>>>>> distinct Kafka topics being read as two Spark Dataframes and join them >>>>>> based on a key and produce the output. *(I am from Spark-SQL >>>>>> background, pardon my Spark-SQL-ish writing) >>>>>> >>>>>> *It may happen, the first topic is receiving new data 15 mins prior >>>>>> to the second topic, in that scenario, how to proceed? I should not lose >>>>>> any data.* >>>>>> >>>>>> As of now, I want to simply pass paragraphs, read them as RDD, >>>>>> convert to DF and then join to get the common keys as the output. (Just >>>>>> for >>>>>> R&D). >>>>>> >>>>>> Started using Spark Streaming and Kafka today itself. >>>>>> >>>>>> Please help! >>>>>> >>>>>> Thanks, >>>>>> Aakash. >>>>>> >>>>> >>> >