[
https://issues.apache.org/jira/browse/SPARK-20559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved SPARK-20559.
-------------------------------
Resolution: Invalid
This should go to [email protected]
> Refreshing a cached RDD without restarting the Spark application
> ----------------------------------------------------------------
>
> Key: SPARK-20559
> URL: https://issues.apache.org/jira/browse/SPARK-20559
> Project: Spark
> Issue Type: Question
> Components: Spark Core
> Affects Versions: 2.1.0
> Reporter: Jayesh lalwani
>
> We have a Structured Streaming application that gets accounts from Kafka into
> a streaming data frame. We have a blacklist of accounts stored in S3 and we
> want to filter out all the accounts that are blacklisted. So, we are loading
> the blacklisted accounts into a batch data frame and joining it with the
> streaming data frame to filter out the bad accounts.
> Now, the blacklist doesn't change very often.. once a week at max. SO, we
> wanted to cache the blacklist data frame to prevent going out to S3
> everytime. Since, the blacklist might change, we want to be able to refresh
> the cache at a cadence, without restarting the whole app.
> So, to begin with we wrote a simple app that caches and refreshes a simple
> data frame. The steps we followed are
> * Create a CSV file
> * load CSV into a DF: df = spark.read.csv(filename)
> * Persist the data frame: df.persist
> * Now when we do df.show, we see the contents of the csv.
> * We change the CSV, and call df.show, we can see that the old contents are
> being displayed, proving that the df is cached
> * df.unpersist
> * df.persist
> * df.show
> * What we see is that the rows that were modified in the CSV are reloaded..
> But new rows aren't
> Is this expected behavior? Is there a better way to refresh cached data
> without restarting the Spark application?
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]