Hi Jacek:
The javadoc mentions that we can only consume data from the data frame in the
addBatch method. So, if I would like to save the data to a new sink then I
believe that I will need to collect the data and then save it. This is the
reason I am asking about how to control the size of
Hi,
> If the data is very large then a collect may result in OOM.
That's a general case even in any part of Spark, incl. Spark Structured
Streaming. Why would you collect in addBatch? It's on the driver side and
as anything on the driver, it's a single JVM (and usually not fault
tolerant)
> Do
Thanks Tathagata for your answer.
The reason I was asking about controlling data size is that the javadoc
indicate you can use foreach or collect on the dataframe. If the data is very
large then a collect may result in OOM.
>From your answer it appears that the only way to control the size (in
1. It is all the result data in that trigger. Note that it takes a
DataFrame which is a purely logical representation of data and has no
association with partitions, etc. which are physical representations.
2. If you want to limit the amount of data that is processed in a trigger,
then you should
Hi:
The documentation for Sink.addBatch is as follows:
/** * Adds a batch of data to this sink. The data for a given `batchId` is
deterministic and if * this method is called more than once with the same
batchId (which will happen in the case of * failures), then `data` should
only be
>
> 1) Parsing data/Schema creation: The Bro IDS logs have a 8 line header
> that contains the 'schema' for the data, each log http/dns/etc will have
> different columns with different data types. So would I create a specific
> CSV reader inherited from the general one? Also I'm assuming this
I can see your point that you don't really want an external process being
used for the streaming data sourceOkay so on the CSV/TSV front, I have
two follow up questions:
1) Parsing data/Schema creation: The Bro IDS logs have a 8 line header that
contains the 'schema' for the data, each log
Cool stuff! A pattern I have seen is to use our CSV/TSV or JSON support to
read bro logs, rather than a python library. This is likely to have much
better performance since we can do all of the parsing on the JVM without
having to flow it though an external python process.
On Tue, Aug 8, 2017 at
Hi All,
I've read the new information about Structured Streaming in Spark, looks
super great.
Resources that I've looked at
- https://spark.apache.org/docs/latest/streaming-programming-guide.html
- https://databricks.com/blog/2016/07/28/structured-streamin
g-in-apache-spark.html
-
Hi All,
I've read the new information about Structured Streaming in Spark, looks
super great.
Resources that I've looked at
- https://spark.apache.org/docs/latest/streaming-programming-guide.html
-
https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
-
10 matches
Mail list logo