I can see your point that you don't really want an external process being used for the streaming data source....Okay so on the CSV/TSV front, I have two follow up questions:
1) Parsing data/Schema creation: The Bro IDS logs have a 8 line header that contains the 'schema' for the data, each log http/dns/etc will have different columns with different data types. So would I create a specific CSV reader inherited from the general one? Also I'm assuming this would need to be in Scala/Java? (I suck at both of those :) 2) Dynamic Tailing: Does the CSV/TSV data sources support dynamic tailing and handle log rotations? Thanks and BTW your Spark Summit talks are really well done and informative. You're an excellent speaker. -Brian On Tue, Aug 8, 2017 at 2:09 PM, Michael Armbrust <mich...@databricks.com> wrote: > Cool stuff! A pattern I have seen is to use our CSV/TSV or JSON support to > read bro logs, rather than a python library. This is likely to have much > better performance since we can do all of the parsing on the JVM without > having to flow it though an external python process. > > On Tue, Aug 8, 2017 at 9:35 AM, Brian Wylie <briford.wy...@gmail.com> > wrote: > >> Hi All, >> >> I've read the new information about Structured Streaming in Spark, looks >> super great. >> >> Resources that I've looked at >> - https://spark.apache.org/docs/latest/streaming-programming-guide.html >> - https://databricks.com/blog/2016/07/28/structured-streaming- >> in-apache-spark.html >> - https://spark.apache.org/docs/latest/streaming-custom-receivers.html >> - http://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/Stru >> ctured%20Streaming%20using%20Python%20DataFrames%20API.html >> >> + YouTube videos from Spark Summit 2016/2017 >> >> So finally getting to my question: >> >> I have Python code that yields a Python generator... this is a great >> streaming approach within Python. I've used it for network packet >> processing and a bunch of other stuff. I'd love to simply hook up this >> generator (that yields python dictionaries) along with a schema definition >> to create an 'unbounded DataFrame' as discussed in >> https://databricks.com/blog/2016/07/28/structured-streaming- >> in-apache-spark.html >> >> Possible approaches: >> - Make a custom receiver in Python: https://spark.apache.o >> rg/docs/latest/streaming-custom-receivers.html >> - Use Kafka (this is definitely possible and good but overkill for my use >> case) >> - Send data out a socket and use socketTextStream to pull back in (seems >> a bit silly to me) >> - Other??? >> >> Since Python Generators so naturally fit into streaming pipelines I'd >> think that this would be straightforward to 'couple' a python generator >> into a Spark structured streaming pipeline.. >> >> I've put together a small notebook just to give a concrete example >> (streaming Bro IDS network data) https://github.com/Kitwa >> re/BroThon/blob/master/notebooks/Bro_IDS_to_Spark.ipynb >> >> Any thoughts/suggestions/pointers are greatly appreciated. >> >> -Brian >> >> >