[
https://issues.apache.org/jira/browse/SPARK-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Armbrust updated SPARK-2119:
------------------------------------
Target Version/s: 1.1.0
> Reading Parquet InputSplits dominates query execution time when reading off S3
> ------------------------------------------------------------------------------
>
> Key: SPARK-2119
> URL: https://issues.apache.org/jira/browse/SPARK-2119
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.0.0
> Reporter: Michael Armbrust
> Assignee: Cheng Lian
> Priority: Critical
>
> Here's the relevant stack trace where things are hanging:
> {code}
> at
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:326)
> at
> parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:370)
> at
> parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:344)
> at
> org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:90)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
> {code}
> We should parallelize or cache or something here.
--
This message was sent by Atlassian JIRA
(v6.2#6252)