I have a SparkSQL dataframe with a a few billion rows that I need to quickly filter down to a few hundred thousand rows, using an operation like (syntax may not be correct)
df = df[ df.filter(lambda x: x.key_col in approved_keys)] I was thinking about serializing the data using parquet and saving it to S3, however as I want to optimize for filtering speed I'm not sure this is the best option. -- Stuart Layton