[ 
https://issues.apache.org/jira/browse/SPARK-30114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30114:
----------------------------------
    Affects Version/s:     (was: 3.0.0)
                       3.1.0

> Improve LIMIT only query by partial listing files
> -------------------------------------------------
>
>                 Key: SPARK-30114
>                 URL: https://issues.apache.org/jira/browse/SPARK-30114
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Lantao Jin
>            Priority: Major
>
> We use Spark as ad-hoc query engine. Most of users' SELECT queries with LIMIT 
> operation like
> 1) SELECT * FROM TABLE_A LIMIT N
> 2) SELECT colA FROM TABLE_A LIMIT N
> 3) CREATE TAB_B as SELECT * FROM TABLE_A LIMIT N
> If the TABLE_A is a large table (a RDD with thousands and thousands of 
> partitions), the execution time would be very big since it has to list all 
> files to build a RDD before execution. But almost time, the N is just like 
> 10, 100, 1000, not very big. We don't need to scan all files. This 
> optimization will create a *SinglePartitionReadRDD* to address it.
> In our production result, this optimization benefits a lot. The duration time 
> of simple query with LIMIT could reduce 5~10 times. For example, before this 
> optimization, a query on a table which has about one hundred thousands files 
> would run over 30 seconds, after applying this optimization, the time 
> decreased to 5 seconds.
> Should support both Spark DataSource Table and Hive Table which can be 
> converted to DataSource table.
> Should support bucket table, partition table, normal table.
> Should support different file formats like parquet, orc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to