I have a SparkSQL dataframe with a a few billion rows that I need to
quickly filter down to a few hundred thousand rows, using an operation like
(syntax may not be correct)

df = df[ df.filter(lambda x: x.key_col in approved_keys)]

I was thinking about serializing the data using parquet and saving it to
S3, however as I want to optimize for filtering speed I'm not sure this is
the best option.

-- 
Stuart Layton

Reply via email to