Re: Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-05 Thread M Singh
Hi Jacek: The javadoc mentions that we can only consume data from the data frame in the addBatch method.  So, if I would like to save the data to a new sink then I believe that I will need to collect the data and then save it.  This is the reason I am asking about how to control the size of

Re: Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-04 Thread Jacek Laskowski
Hi, > If the data is very large then a collect may result in OOM. That's a general case even in any part of Spark, incl. Spark Structured Streaming. Why would you collect in addBatch? It's on the driver side and as anything on the driver, it's a single JVM (and usually not fault tolerant) > Do

Re: Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-04 Thread M Singh
Thanks Tathagata for your answer. The reason I was asking about controlling data size is that the javadoc indicate you can use foreach or collect on the dataframe.  If the data is very large then a collect may result in OOM. >From your answer it appears that the only way to control the size (in

Re: Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-03 Thread Tathagata Das
1. It is all the result data in that trigger. Note that it takes a DataFrame which is a purely logical representation of data and has no association with partitions, etc. which are physical representations. 2. If you want to limit the amount of data that is processed in a trigger, then you should

Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-03 Thread M Singh
Hi: The documentation for Sink.addBatch is as follows:   /**   * Adds a batch of data to this sink. The data for a given `batchId` is deterministic and if   * this method is called more than once with the same batchId (which will happen in the case of   * failures), then `data` should only be

Re: Question about 'Structured Streaming'

2017-08-08 Thread Michael Armbrust
> > 1) Parsing data/Schema creation: The Bro IDS logs have a 8 line header > that contains the 'schema' for the data, each log http/dns/etc will have > different columns with different data types. So would I create a specific > CSV reader inherited from the general one? Also I'm assuming this

Re: Question about 'Structured Streaming'

2017-08-08 Thread Brian Wylie
I can see your point that you don't really want an external process being used for the streaming data sourceOkay so on the CSV/TSV front, I have two follow up questions: 1) Parsing data/Schema creation: The Bro IDS logs have a 8 line header that contains the 'schema' for the data, each log

Re: Question about 'Structured Streaming'

2017-08-08 Thread Michael Armbrust
Cool stuff! A pattern I have seen is to use our CSV/TSV or JSON support to read bro logs, rather than a python library. This is likely to have much better performance since we can do all of the parsing on the JVM without having to flow it though an external python process. On Tue, Aug 8, 2017 at

Fwd: Python question about 'Structured Streaming'

2017-08-08 Thread Brian Wylie
Hi All, I've read the new information about Structured Streaming in Spark, looks super great. Resources that I've looked at - https://spark.apache.org/docs/latest/streaming-programming-guide.html - https://databricks.com/blog/2016/07/28/structured-streamin g-in-apache-spark.html -

Question about 'Structured Streaming'

2017-08-08 Thread Brian Wylie
Hi All, I've read the new information about Structured Streaming in Spark, looks super great. Resources that I've looked at - https://spark.apache.org/docs/latest/streaming-programming-guide.html - https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html -