When you cache a dataframe, you actually cache a logical plan. That's why re-creating the dataframe doesn't work: Spark finds out the logical plan is cached and picks the cached data.
You need to uncache the dataframe, or go back to the SQL way: df.createTempView("abc") spark.table("abc").cache() df.show // returns latest data. spark.table("abc").show // returns cached data. On Mon, May 20, 2019 at 3:33 AM Tomas Bartalos <tomas.barta...@gmail.com> wrote: > I'm trying to re-read however I'm getting cached data (which is a bit > confusing). For re-read I'm issuing: > spark.read.format("delta").load("/data").groupBy(col("event_hour")).count > > The cache seems to be global influencing also new dataframes. > > So the question is how should I re-read without loosing the cached data > (without using unpersist) ? > > As I mentioned with sql its possible - I can create a cached view, so wen > I access the original table I get live data, when I access the view I get > cached data. > > BR, > Tomas > > On Fri, 17 May 2019, 8:57 pm Sean Owen, <sro...@gmail.com> wrote: > >> A cached DataFrame isn't supposed to change, by definition. >> You can re-read each time or consider setting up a streaming source on >> the table which provides a result that updates as new data comes in. >> >> On Fri, May 17, 2019 at 1:44 PM Tomas Bartalos <tomas.barta...@gmail.com> >> wrote: >> > >> > Hello, >> > >> > I have a cached dataframe: >> > >> > >> spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.cache >> > >> > I would like to access the "live" data for this data frame without >> deleting the cache (using unpersist()). Whatever I do I always get the >> cached data on subsequent queries. Even adding new column to the query >> doesn't help: >> > >> > >> spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.withColumn("dummy", >> lit("dummy")) >> > >> > >> > I'm able to workaround this using cached sql view, but I couldn't find >> a pure dataFrame solution. >> > >> > Thank you, >> > Tomas >> >