[pyspark] structured streaming deployment & monitoring recommendation

2018-02-12 Thread Bram
Hi,

I have questions regarding spark structured streaming deployment
recommendation

I have +- 100 kafka topics that can be processed using similar code block.
I am using pyspark 2.2.1

Here is the idea:


TOPIC_LIST = ["topic_a","topic_b"."topic_c"]

stream = {}
for t in TOPIC_LIST:
stream[t] = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers",
"kafka-broker1:9092,kafka-broker2:9092") \
.option("subscribe", "{0}".format(t)) \
.load().writeStream \
.trigger(processingTime="10 seconds") \
.format("console") \
.start()


So far, I have these options:
1. do a single spark submit for all topics
2. do multiple spark submit per topics

I am thinking of deploying this using supervisor or upstart so when spark
app is down, I can get notified.

By submitting single spark app I can have more control on server resources
but a little bit hard to monitor if one app is having problem. For multiple
spark submits I am afraid I need to provide +- 100 cpu cores which I doubt
will be fully utilized as some of the topics do not have high througput,
however by creating +- 100 bash scripts maintained using supervisor I can
get alert for each of them.

For monitoring I am considering using this ->
https://argus-sec.com/monitoring-spark-prometheus/

Any recommendation or advices guys?

Thanks

Regards,

Abraham


[Structured Streaming] Deserializing avro messages from kafka source using schema registry

2018-02-09 Thread Bram
Hi,

I couldn't find any documentation about avro message deserialization using
pyspark structured streaming. My aim is using confluent schema registry to
get per topic schema then parse the avro messages with it.

I found one but it was using DirectStream approach
https://stackoverflow.com/questions/30339636/spark-python-avro-kafka-deserialiser

Can anyone show me how?

Thanks

Regards,

Abraham