We can try to add this as part of some hive refactoring we are doing for 1.4. I've created a JIRA: https://issues.apache.org/jira/browse/SPARK-6910
On Tue, Apr 14, 2015 at 9:58 AM, Cheolsoo Park <piaozhe...@gmail.com> wrote: > Is there a plan to fix this? I also ran into this issue with a *"select * > from tbl where ... limit 10"* query. Spark SQL is 100x slower than Presto > in worst case (1.6M partitions table). This is a serious blocker for us > since we have many tables with near (and over) 1M partitions, and any query > against these big tables wastes 5 minutes to get full partitions info. > > I briefly looked at the code, and it looks like resolving metastore > relations is the first thing that the analyzer does prior to any other > optimization rules such as partition pruning. So in the Hive metastore > client, it ends up calling getAllPartitions() with no filter expression. I > am wondering how much work will be involved to fix this issue. Can you > please advise what you think should be done? > > > On Mon, Apr 13, 2015 at 3:27 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> Yeah, we don't currently push down predicates into the metastore. Though, >> we do prune partitions based on predicates (so we don't read the data). >> >> On Mon, Apr 13, 2015 at 2:53 PM, Tom Graves <tgraves...@yahoo.com.invalid >> > >> wrote: >> >> > Hey, >> > I was trying out spark sql using the HiveContext and doing a select on a >> > partitioned table with lots of partitions (16,000+). It took over 6 >> minutes >> > before it even started the job. It looks like it was querying the Hive >> > metastore and got a good chunk of data back. Which I'm guessing is >> info on >> > the partitions. Running the same query using hive takes 45 seconds for >> the >> > entire job. >> > I know spark sql doesn't support all the hive optimization. Is this a >> > known limitation currently? >> > Thanks,Tom >> > >