pyspark: dataframe.take is slow

immerrr again Tue, 05 Jul 2016 02:28:35 -0700

Hi all!

I'm having a strange issue with pyspark 1.6.1. I have a dataframe,


    df = sqlContext.read.parquet('/path/to/data')

whose "df.take(10)" is really slow, apparently scanning the whole
dataset to take the first ten rows. "df.first()" works fast, as does
"df.rdd.take(10)".

I have found https://issues.apache.org/jira/browse/SPARK-10731 that
should have fixed it in 1.6.0, but it has not. What am i doing wrong
here and how can I fix this?

Cheers,
immerrr

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

pyspark: dataframe.take is slow

Reply via email to