[Spark Core] excessive read/load times on parquet files in 2.2 vs 2.0

Matthew Anthony Fri, 08 Sep 2017 08:46:47 -0700

Hi all -

since upgrading to 2.2.0, we've noticed a significant increase inread.parquet(...) ops. The parquet files are being read from S3. Uponentry at the interactive terminal (pyspark in this case), the terminalwill sit "idle" for several minutes (as many as 10) before returning:

"17/09/08 15:34:37 WARN SharedInMemoryCache: Evicting cached tablepartition metadata from memory due to size constraints(spark.sql.hive.filesourcePartitionFileCacheSize = 2000000000 bytes).This may impact query planning performance."

In the spark UI, there are no jobs being run during this idle period.Subsequently, a short 1-task job lasting approximately 10 seconds runs,and then another idle time of roughly 2-3 minutes follows thereafterbefore returning to the terminal/CLI.

Can someone explain what is happening here in the background? Is there amisconfiguration we should be looking for? We are using Hive metastoreon the EMR cluster.



---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[Spark Core] excessive read/load times on parquet files in 2.2 vs 2.0

Reply via email to