Michael Armbrust created SPARK-2119:
---------------------------------------

             Summary: Reading Parquet InputSplits dominates query execution 
time when reading off S3
                 Key: SPARK-2119
                 URL: https://issues.apache.org/jira/browse/SPARK-2119
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.0.0
            Reporter: Michael Armbrust
            Priority: Critical


Here's the relevant stack trace where things are hanging:

{code}
        at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:326)
        at 
parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:370)
        at 
parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:344)
        at 
org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:90)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
{code}

We should parallelize or cache or something here.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to