What are the best options for quickly filtering a DataFrame on a single column?

Stuart Layton Wed, 25 Mar 2015 07:43:33 -0700

I have a SparkSQL dataframe with a a few billion rows that I need to
quickly filter down to a few hundred thousand rows, using an operation like
(syntax may not be correct)


df = df[ df.filter(lambda x: x.key_col in approved_keys)]

I was thinking about serializing the data using parquet and saving it to
S3, however as I want to optimize for filtering speed I'm not sure this is
the best option.

-- 
Stuart Layton

What are the best options for quickly filtering a DataFrame on a single column?

Reply via email to