[
https://issues.apache.org/jira/browse/SPARK-30114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun updated SPARK-30114:
----------------------------------
Affects Version/s: (was: 3.0.0)
3.1.0
> Improve LIMIT only query by partial listing files
> -------------------------------------------------
>
> Key: SPARK-30114
> URL: https://issues.apache.org/jira/browse/SPARK-30114
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.1.0
> Reporter: Lantao Jin
> Priority: Major
>
> We use Spark as ad-hoc query engine. Most of users' SELECT queries with LIMIT
> operation like
> 1) SELECT * FROM TABLE_A LIMIT N
> 2) SELECT colA FROM TABLE_A LIMIT N
> 3) CREATE TAB_B as SELECT * FROM TABLE_A LIMIT N
> If the TABLE_A is a large table (a RDD with thousands and thousands of
> partitions), the execution time would be very big since it has to list all
> files to build a RDD before execution. But almost time, the N is just like
> 10, 100, 1000, not very big. We don't need to scan all files. This
> optimization will create a *SinglePartitionReadRDD* to address it.
> In our production result, this optimization benefits a lot. The duration time
> of simple query with LIMIT could reduce 5~10 times. For example, before this
> optimization, a query on a table which has about one hundred thousands files
> would run over 30 seconds, after applying this optimization, the time
> decreased to 5 seconds.
> Should support both Spark DataSource Table and Hive Table which can be
> converted to DataSource table.
> Should support bucket table, partition table, normal table.
> Should support different file formats like parquet, orc.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]