I am using structured streaming to evaluate multiple rules on same running
stream.
I have two options to do that. One is to use forEach and evaluate all the
rules on the row..
The other option is to express rules in spark sql dsl and run multiple
queries.
I was wondering if option 1 will result in
From: Sumit Saraswat [mailto:sumitxapa...@gmail.com]
Sent: Tuesday, August 8, 2017 10:33 PM
To: user@spark.apache.org
Subject: Unsubscribe
Unsubscribe
Hi,
We've built a batch application on Spark 1.6.1. I'm looking into how to run
the same code as a streaming (DStream based) application. This is using
pyspark.
In the batch application, we have a sequence of transforms that read from
file, do dataframe operations, then write to file. I was
>
> 1) Parsing data/Schema creation: The Bro IDS logs have a 8 line header
> that contains the 'schema' for the data, each log http/dns/etc will have
> different columns with different data types. So would I create a specific
> CSV reader inherited from the general one? Also I'm assuming this
Hi,
we have a structured streaming app consuming data from kafka and writing to
s3.
I keep getting this timeout exception whenever the executor is specified and
running with more than one core per executor. If someone can share any info
related to this if you know it would be great.
17/08/08
I have data streaming into my spark scala application in this format
idmark1 mark2 mark3 time
uuid1 100 200 300 Tue Aug 8 14:06:02 PDT 2017
uuid1 100 200 300 Tue Aug 8 14:06:22 PDT 2017
uuid2 150 250 350 Tue Aug 8 14:06:32 PDT 2017
uuid2 150 250 350 Tue Aug 8
Considering the @transient annotations and the work done in the instance
initializer, not much state is really be broadcast to the executors. It
might be simpler to just create these instances on the executors, rather
than trying to broadcast them?
--
View this message in context:
I can see your point that you don't really want an external process being
used for the streaming data sourceOkay so on the CSV/TSV front, I have
two follow up questions:
1) Parsing data/Schema creation: The Bro IDS logs have a 8 line header that
contains the 'schema' for the data, each log
Hello list
We have a Spark application that performs a set of ETLs: reading messages
from a Kafka topic, categorizing them, and writing the contents out as
Parquet files on HDFS. After writing, we are querying the data from HDFS
using Presto's hive integration. We are having problems because the
Cool stuff! A pattern I have seen is to use our CSV/TSV or JSON support to
read bro logs, rather than a python library. This is likely to have much
better performance since we can do all of the parsing on the JVM without
having to flow it though an external python process.
On Tue, Aug 8, 2017 at
From: john_test_test
Sent: Wednesday, August 9, 2017 3:09:44 AM
To: user@spark.apache.org
Subject: speculative execution in spark
Is it possible by anyhow to take advantage of the already processed portion
of the failed task so I can
No. will take a look now.
On Tue, Aug 8, 2017 at 1:47 PM, Riccardo Ferrari wrote:
> Hi,
>
> Have you tried to check the "Streaming" tab menu?
>
> Best,
>
> On Tue, Aug 8, 2017 at 4:15 PM, KhajaAsmath Mohammed <
> mdkhajaasm...@gmail.com> wrote:
>
>> Hi,
>>
>> I am running
Hi,
Have you tried to check the "Streaming" tab menu?
Best,
On Tue, Aug 8, 2017 at 4:15 PM, KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:
> Hi,
>
> I am running spark streaming job which receives data from azure iot hub. I
> am not sure if the connection was successful and receving
Scala doesn't support ranges >= Int.MaxValue
https://github.com/scala/scala/blob/2.12.x/src/library/scala/collection/immutable/Range.scala?utf8=✓#L89
You can create two RDDs and unionize them:
scala> val rdd = sc.parallelize(1L to
Int.MaxValue.toLong).union(sc.parallelize(1L to
Hello,
I'd like to count more than Int.MaxValue. But I encountered the following
error.
scala> val rdd = sc.parallelize(1L to Int.MaxValue*2.toLong)
rdd: org.apache.spark.rdd.RDD[Long] = ParallelCollectionRDD[28] at
parallelize at :24
scala> rdd.count
java.lang.IllegalArgumentException: More
Hi All,
I've read the new information about Structured Streaming in Spark, looks
super great.
Resources that I've looked at
- https://spark.apache.org/docs/latest/streaming-programming-guide.html
- https://databricks.com/blog/2016/07/28/structured-streamin
g-in-apache-spark.html
-
Is it possible by anyhow to take advantage of the already processed portion
of the failed task so I can use the speculative execution to reassign only
the what is left from the original task to another node? if yes then how can
I read it from memroy?
--
View this message in context:
Unsubscribe
Hi All,
I've read the new information about Structured Streaming in Spark, looks
super great.
Resources that I've looked at
- https://spark.apache.org/docs/latest/streaming-programming-guide.html
-
https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
-
unsubscribe
Hi,
I am doing a large number of aggregations on a dataframe (without groupBy) to
get some statistics. As part of this I am doing an approx_count_distinct(c,
0.01)
Everything works fine but when I do the same aggregation a second time (for
each column) I get the following error:
[Stage 2:>
Hi,
I am running spark streaming job which receives data from azure iot hub. I
am not sure if the connection was successful and receving any data. does
the input column show how much data it has read if the connection was
successful?
[image: Inline image 1]
unsubscribe
On Mon, Aug 7, 2017 at 2:57 PM, Sumit Saraswat
wrote:
> Unsubscribe
>
Hi Michael,
That reflects my sentiments so well. Thanks for having confirmed my thoughts!
https://issues.apache.org/jira/browse/SPARK-21667
Pozdrawiam,
Jacek Laskowski
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow me at
Hi, John.
I've had similar problems. IIRC, the driver was GCing madly. I don't know
why the driver was doing so much work but I quickly implemented an
alternative approach. The code I wrote belongs to my client but I wrote
something that should be equivalent. It can be found at:
For Dataframe (and Dataset) cache(), neither Java nor Kryo serialization
is used. There is no way to use Java or Kryo serialization for
DataFrame.cache() or Dataset.cache() for in-memory.
Are you talking about serialization to Disk? In previous mail, I talked
about only in-memory.
Regards,
26 matches
Mail list logo