There is a trade off involved here. If you have a Spark application with a complicated logical graph, you can either cache data at certain points in the DAG, or you don’t cache data. The side effect of caching data is higher memory usage. The side effect of not caching data is higher CPU usage and perhaps, IO. Ultimately, you can increase both memory and CPU by adding more workers to your cluster, and adding workers costs money. So, your caching choices are reflected in the overall cost of running your application. You need to do some analysis to determine the caching configuration the will result in lowest cost. Usually, being selective about which dataframes to cache results in a good balance between memory usage and CPU usage
I will not write data back to S3 and read it back in as a practice. Essentially, you are using S3 as a “cache”. However, reading and writing from S3 is not a scalable solution because it results in higher IO and IO doesn’t scale up as easily as CPU and Memory. The only time I would use S3 as a cache will be when by cached data is in terabyte+ range. If you are caching gigabytes of data, then you are better off caching in memory. This is 2018. Memory is cheap but limited. From: Valery Khamenya <khame...@gmail.com> Date: Tuesday, May 1, 2018 at 9:17 AM To: "user@spark.apache.org" <user@spark.apache.org> Subject: smarter way to "forget" DataFrame definition and stick to its values hi all a short example before the long story: var accumulatedDataFrame = ... // initialize for (i <- 1 to 100) { val myTinyNewData = ... // my slowly calculated new data portion in tiny amounts accumulatedDataFrame = accumulatedDataFrame.union(myTinyNewData) // how to stick here to the values of accumulatedDataFrame only and forget definitions?! } this kind of stuff is likely to get slower and slower on each iteration even if myTinyNewData is quite compact. Usually I write accumulatedDataFrame to S3 and then re-load it back to clear the definition history. It makes code ugly though. Are there any smarter way? It happens very often that a DataFrame is created via complex definitions. The DataFrame is then re-used in several places and sometimes it gets recalculated triggering a heavy cascade of operations. Of course one could use .persist or .cache modifiers, but the result is unfortunately not transparent and instead of speeding up things it results in slow-down or even lost jobs if storage resources are not enough. Any advice? best regards -- Valery ________________________________________________________ The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.