We would like to utilize maintaining an arbitrary state between invokations
of the iterations of StructuredStreaming in python
How can we maintain a static DataFrame that acts as state between the
iterations?
Several options that may be relevant:
1. in Spark memory (distributed across the workers
How about using Livy to submit jobs?
On Thu, 17 May 2018 at 7:24 am, Marcelo Vanzin wrote:
> You can either:
>
> - set spark.yarn.submit.waitAppCompletion=false, which will make
> spark-submit go away once the app starts in cluster mode.
> - use the (new in 2.3) InProcessLauncher class + some cu
You can either:
- set spark.yarn.submit.waitAppCompletion=false, which will make
spark-submit go away once the app starts in cluster mode.
- use the (new in 2.3) InProcessLauncher class + some custom Java code
to submit all the apps from the same "launcher" process.
On Wed, May 16, 2018 at 1:45 P
Hi Spark-users,
I want to submit as many spark applications as the resources permit. I am
using cluster mode on a yarn cluster. Yarn can queue and launch these
applications without problems. The problem lies on spark-submit itself.
Spark-submit starts a jvm which could fail due to insufficient me
--
Regards,
Varma Dantuluri
Hi
I would go for a regular mysql bulkload. I m saying writing an output
that mysql is able to load in one process. I d'say spark jdbc is ok for
small fetch/load. When comes large RDBMS call, it turns out using the
regular optimized API is better than jdbc
2018-05-16 16:18 GMT+02:00 Vadim Semenov
Yes, the workaround is to create multiple StringIndexers as you described.
OneHotEncoderEstimator is only in Spark 2.3.0, you will have to use just
OneHotEncoder.
On Tue, May 15, 2018, 8:40 AM Mina Aslani wrote:
> Hi,
>
> So, what is the workaround? Should I create multiple indexer(one for each
Hello,
I have a structured streaming job that consumes messages from kafka and does
some stateful associations using flatMapGroupWithState. Every time I submit
the job, it runs fine for around 2hours and then stops abruptly without any
error messages. All I can see in the debug logs is the below m
Upon downsizing to 20 partitions some of your partitions become too big,
and I see that you're doing caching, and executors try to write big
partitions to disk, but fail because they exceed 2GiB
> Caused by: java.lang.IllegalArgumentException: Size exceeds
Integer.MAX_VALUE
at sun.nio.ch.FileChann
Hi,
I am using SPARK to read the XML / JSON files to create a dataframe and
save it as a hive table
Sample XML file:
101
45
COMMAND
Note field 'validation-timeout' under testexecutioncontroller.
Below is the schema populated by DF after reading the XML file
|-- id:
I have been testing some with aggregations, but I seem to hit a wall on two
issues.
example:
val avg = areaStateDf.groupBy($"plantKey").avg("sensor")
1) How can I use the result from an aggr within the same stream, to do further
calculations?
2) It seems to be very slow. If I want a moving windo
Hi,
we have 2 millions of rows using a cluster using an EMR cluster with 8
machines m4.4xlarge with 100GB EBS storage.
Davide B.
Davide Brambilla
ContentWise R&D
ContentWise
davide.brambi...@contentwise.tv
We implemented a streaming query with aggregation on event-time with
watermark. I'm wondering why aggregation state is not cleanup up. According
to documentation old aggregation state should be cleared when using
watermarks. We also don't see any condition [1] for why state should not be
cleanup up
How many rows do you have in total?
> On 16. May 2018, at 11:36, Davide Brambilla
> wrote:
>
> Hi all,
>we have a dataframe with 1000 partitions and we need to write the
> dataframe into a MySQL using this command:
>
> df.coalesce(20)
> df.write.jdbc(url=url,
> table=tab
Hi all,
we have a dataframe with 1000 partitions and we need to write the
dataframe into a MySQL using this command:
df.coalesce(20)
df.write.jdbc(url=url,
table=table,
mode=mode,
properties=properties)
and we get this errors randomly
java
First thing would be that scala supports them. Then for other things someone
might need to redesign the Spark source code to leverage modules - this could
be a rather handy feature to have a small but very well designed core (core,
ml, graph etc) around which others write useful modules.
> On 1
16 matches
Mail list logo