Michael Armbrust created SPARK-2119: ---------------------------------------
Summary: Reading Parquet InputSplits dominates query execution time when reading off S3 Key: SPARK-2119 URL: https://issues.apache.org/jira/browse/SPARK-2119 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Michael Armbrust Priority: Critical Here's the relevant stack trace where things are hanging: {code} at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:326) at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:370) at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:344) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:90) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201) {code} We should parallelize or cache or something here. -- This message was sent by Atlassian JIRA (v6.2#6252)