There is a trade off involved here. If you have a Spark application with a 
complicated logical graph, you can either cache data at certain points in the 
DAG, or you don’t cache data. The side effect of caching data is higher memory 
usage. The side effect of not caching data is higher CPU usage and perhaps, IO. 
Ultimately, you can increase both memory and CPU by adding more workers to your 
cluster, and adding workers costs money. So, your caching choices are reflected 
in the overall cost of running your application. You need to do some analysis 
to determine the caching configuration the will result in lowest cost. Usually, 
being selective about which dataframes to cache results in a good balance 
between memory usage and CPU usage

I will not write data back to S3 and read it back in as a practice. 
Essentially, you are using S3 as a “cache”. However, reading and writing from 
S3 is not a scalable solution because it results in higher IO and IO doesn’t 
scale up as easily as CPU and Memory. The only time I would use S3 as a cache 
will be when by cached data is in terabyte+ range. If you are caching gigabytes 
of data, then you are better off caching in memory. This is 2018. Memory is 
cheap but limited.

From: Valery Khamenya <khame...@gmail.com>
Date: Tuesday, May 1, 2018 at 9:17 AM
To: "user@spark.apache.org" <user@spark.apache.org>
Subject: smarter way to "forget" DataFrame definition and stick to its values

hi all

a short example before the long story:

  var accumulatedDataFrame = ... // initialize

  for (i <- 1 to 100) {
    val myTinyNewData = ... // my slowly calculated new data portion in tiny 
amounts
    accumulatedDataFrame = accumulatedDataFrame.union(myTinyNewData)
    // how  to stick here to the values of accumulatedDataFrame only and forget 
definitions?!
  }

this kind of stuff is likely to get slower and slower on each iteration even if 
myTinyNewData is quite compact. Usually I write accumulatedDataFrame to S3 and 
then re-load it back to clear the definition history. It makes code ugly 
though. Are there any smarter way?

It happens very often that a DataFrame is created via complex definitions. The 
DataFrame is then re-used in several places and sometimes it gets recalculated 
triggering a heavy cascade of operations.

Of course one could use .persist or .cache modifiers, but the result is 
unfortunately not transparent and instead of speeding up things it results in 
slow-down or even lost jobs if storage resources are not enough.

Any advice?

best regards
--
Valery
________________________________________________________

The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.

Reply via email to