[ 
https://issues.apache.org/jira/browse/SPARK-20559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20559.
-------------------------------
    Resolution: Invalid

This should go to [email protected]

> Refreshing a cached RDD without restarting the Spark application
> ----------------------------------------------------------------
>
>                 Key: SPARK-20559
>                 URL: https://issues.apache.org/jira/browse/SPARK-20559
>             Project: Spark
>          Issue Type: Question
>          Components: Spark Core
>    Affects Versions: 2.1.0
>            Reporter: Jayesh lalwani
>
> We have a Structured Streaming application that gets accounts from Kafka into 
> a streaming data frame. We have a blacklist of accounts stored in S3 and we 
> want to filter out all the accounts that are blacklisted. So, we are loading 
> the blacklisted accounts into a batch data frame and joining it with the 
> streaming data frame to filter out the bad accounts.
> Now, the blacklist doesn't change very often.. once a week at max. SO, we 
> wanted to cache the blacklist data frame to prevent going out to S3 
> everytime. Since, the blacklist might change, we want to be able to refresh 
> the cache at a cadence, without restarting the whole app.
> So, to begin with we wrote a simple app that caches and refreshes a simple 
> data frame. The steps we followed are
> * Create a CSV file
> * load CSV into a DF: df = spark.read.csv(filename)
> * Persist the data frame: df.persist
> * Now when we do df.show, we see the contents of the csv.
> * We change the CSV, and call df.show, we can see that the old contents are 
> being displayed, proving that the df is cached
> * df.unpersist
> * df.persist
> * df.show
> * What we see is that the rows that were modified in the CSV are reloaded.. 
> But new rows aren't
> Is this expected behavior? Is there a better way to refresh cached data 
> without restarting the Spark application?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to