Increase or Decrease the number of data partitions: Since a data partition
represents the quantum of data to be processed together by a single Spark
Task, there could be situations:
(a) Where existing number of data partitions are not sufficient enough in
order to maximize the usage of available r
A much better one-liner (easier to understand the UI because it will be 1
simple job with 2 stages):
```
spark.read.text("README.md").repartition(2).take(1)
```
Attila Zsolt Piros wrote
> No, it won't be reused.
> You should reuse the dateframe for reusing the shuffle blocks (and cached
> data).
No, it won't be reused.
You should reuse the dateframe for reusing the shuffle blocks (and cached
data).
I know this because the two actions will lead to building a two separate
DAGs, but I will show you a way how you could check this on your own (with a
small simple spark application).
For this
Hi,
An interesting question that I must admit I'm not sure how to answer myself
actually :)
Off the top of my head, I'd **guess** unless you cache the first query
these two queries would share nothing. With caching, there's a phase in
query execution when a canonicalized version of a query is use
is shuffle file re-use based on identity or equality of the dataframe?
for example if run the exact same code twice to load data and do transforms
(joins, aggregations, etc.) but without re-using any actual dataframes,
will i still see skipped stages thanks to shuffle file re-use?
thanks!
koert