Hi all,
I have IoT time series data in Kafka and reading it in static dataframe as:
df = spark.read\
.format("kafka")\
.option("zookeeper.connect", "localhost:2181")\
.option("kafka.bootstrap.servers", "localhost:9092")\
.option("subscribe", "test_topic")\
.option("failOnDataLoss", "false")\
.optio
Hi Akash,
Glad to know that repartition helped!
The overall tasks actually depends on the kind of operations you are
performing and also on how the DF is partitioned.
I can't comment on the former but can provide some pointers on the latter.
Default value of spark.sql.shuffle.partitions is 200.
I don't know if this is the best way or not, but:
val indexer = new StringIndexer().setInputCol("vr").setOutputCol("vrIdx")
val indexModel = indexer.fit(data)
val indexedData = indexModel.transform(data)
val variables = indexModel.labels.length
val toSeq = udf((a: Double, b: Double) => Seq(a, b))
I work with a lot of data in a long format, cases in which an ID column is
repeated, followed by a variable and a value column like so:
+---+-+---+
|ID | var | value |
+---+-+---+
| A | v1 | 1.0 |
| A | v2 | 2.0 |
| B | v1 | 1.5 |
| B | v3 | -1.0 |
+---+-+---+
I
Is there a way to write rows directly into off-heap memory in the Tungsten
format bypassing creating objects?
I have a lot of rows, and right now I'm creating objects, and they get
encoded, but because of the number of rows, it creates significant pressure
on GC. I'd like to avoid creating objects
We use spark testing base for unit testing. These tests execute on a very
small amount of data that covers all paths the code can take (or most paths
anyway).
https://github.com/holdenk/spark-testing-base
For integration testing we use automated routines to ensure that aggregate
values match an
Hi,
I wrote this answer to the same question a couple of years ago:
https://www.mail-archive.com/user%40spark.apache.org/msg48032.html
I have made a couple of presentations on the subject. Slides and video
are linked on this page: http://www.mapflat.com/presentations/
You can find more material
Hi Srinath,
Thanks for such an elaborate reply. How to reduce the number of overall
tasks?
I found, after simply repartitioning the csv file into 8 parts and
converting it to parquet with snappy compression, helped not only in even
distribution of the tasks on all nodes, but also helped in bringi
hello, Can I do complex data manipulations inside groupby function.? i.e. I
want to group my whole dataframe by a column and then do some processing for
each group.
The information contained in this message is intended only for the recipient,
and may be a conf
Hi Aakash,
Can you check the logs for Executor ID 0? It was restarted on worker
192.168.49.39 perhaps due to OOM or something.
Also observed that the number of tasks are high and unevenly distributed
across the workers.
Check if there are too many partitions in the RDD and tune it using
spark.sql
Yes, but when I did increase my executor memory, the spark job is going to
halt after running a few steps, even though, the executor isn't dying.
Data - 60,000 data-points, 230 columns (60 MB data).
Any input on why it behaves like that?
On Tue, Jun 12, 2018 at 8:15 AM, Vamshi Talla wrote:
> A
11 matches
Mail list logo