I can see your point that you don't really want an external process being
used for the streaming data source....Okay so on the CSV/TSV front, I have
two follow up questions:

1) Parsing data/Schema creation: The Bro IDS logs have a 8 line header that
contains the 'schema' for the data, each log http/dns/etc will have
different columns with different data types. So would I create a specific
CSV reader inherited from the general one?  Also I'm assuming this would
need to be in Scala/Java? (I suck at both of those :)

2) Dynamic Tailing: Does the CSV/TSV data sources support dynamic tailing
and handle log rotations?

Thanks and BTW your Spark Summit talks are really well done and
informative. You're an excellent speaker.

-Brian

On Tue, Aug 8, 2017 at 2:09 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> Cool stuff! A pattern I have seen is to use our CSV/TSV or JSON support to
> read bro logs, rather than a python library.  This is likely to have much
> better performance since we can do all of the parsing on the JVM without
> having to flow it though an external python process.
>
> On Tue, Aug 8, 2017 at 9:35 AM, Brian Wylie <briford.wy...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> I've read the new information about Structured Streaming in Spark, looks
>> super great.
>>
>> Resources that I've looked at
>> - https://spark.apache.org/docs/latest/streaming-programming-guide.html
>> - https://databricks.com/blog/2016/07/28/structured-streaming-
>> in-apache-spark.html
>> - https://spark.apache.org/docs/latest/streaming-custom-receivers.html
>> - http://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/Stru
>> ctured%20Streaming%20using%20Python%20DataFrames%20API.html
>>
>> + YouTube videos from Spark Summit 2016/2017
>>
>> So finally getting to my question:
>>
>> I have Python code that yields a Python generator... this is a great
>> streaming approach within Python. I've used it for network packet
>> processing and a bunch of other stuff. I'd love to simply hook up this
>> generator (that yields python dictionaries) along with a schema definition
>> to create an  'unbounded DataFrame' as discussed in
>> https://databricks.com/blog/2016/07/28/structured-streaming-
>> in-apache-spark.html
>>
>> Possible approaches:
>> - Make a custom receiver in Python: https://spark.apache.o
>> rg/docs/latest/streaming-custom-receivers.html
>> - Use Kafka (this is definitely possible and good but overkill for my use
>> case)
>> - Send data out a socket and use socketTextStream to pull back in (seems
>> a bit silly to me)
>> - Other???
>>
>> Since Python Generators so naturally fit into streaming pipelines I'd
>> think that this would be straightforward to 'couple' a python generator
>> into a Spark structured streaming pipeline..
>>
>> I've put together a small notebook just to give a concrete example
>> (streaming Bro IDS network data) https://github.com/Kitwa
>> re/BroThon/blob/master/notebooks/Bro_IDS_to_Spark.ipynb
>>
>> Any thoughts/suggestions/pointers are greatly appreciated.
>>
>> -Brian
>>
>>
>

Reply via email to