[ 
https://issues.apache.org/jira/browse/SPARK-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059722#comment-14059722
 ] 

Zongheng Yang commented on SPARK-2443:
--------------------------------------

I am able to reproduce this behavior locally: 2 million rows of key-val pairs 
stored as a text file (hence this issue is probably not parquet-specific), 
non-partitioned vs. partitioned into 1 part, and I have seen a more than 10x 
difference.

In the non-partitioned case, this is the `inputRdd` in HiveTableScan:

MapPartitionsRDD[15] at mapPartitions at TableReader.scala:102 (2 partitions)
  MappedRDD[14] at map at TableReader.scala:222 (2 partitions)
    HadoopRDD[13] at HadoopRDD at TableReader.scala:212 (2 partitions)

In the partitioned case, there's an extra UnionRDD:

UnionRDD[4] at UnionRDD at TableReader.scala:183 (2 partitions)
  MapPartitionsRDD[3] at mapPartitions at TableReader.scala:164 (2 partitions)
    MappedRDD[2] at map at TableReader.scala:222 (2 partitions)
      HadoopRDD[1] at HadoopRDD at TableReader.scala:212 (2 partitions)

The times to make these two RDDs are about the same, and hence the difference 
in performance lies in the henceforth actual computations of them. Will 
investigate further.


> Reading from Partitioned Tables is Slow
> ---------------------------------------
>
>                 Key: SPARK-2443
>                 URL: https://issues.apache.org/jira/browse/SPARK-2443
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Michael Armbrust
>            Assignee: Zongheng Yang
>
> Here are some numbers, all queries return ~20million:
> {code}
> SELECT COUNT(*) FROM <non partitioned table>
> 5.496467726 s
> SELECT COUNT(*) FROM <partitioned table stored in parquet>
> 50.266666947 s
> SELECT COUNT(*) FROM <same table as previous but loaded with parquetFile 
> instead of through hive>
> 2s
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to