Re: Spark Sql reading hive partitioned tables?

Michael Armbrust Tue, 14 Apr 2015 11:13:41 -0700

We can try to add this as part of some hive refactoring we are doing for
1.4.  I've created a JIRA: https://issues.apache.org/jira/browse/SPARK-6910


On Tue, Apr 14, 2015 at 9:58 AM, Cheolsoo Park <piaozhe...@gmail.com> wrote:

> Is there a plan to fix this? I also ran into this issue with a *"select *
> from tbl where ... limit 10"* query. Spark SQL is 100x slower than Presto
> in worst case (1.6M partitions table). This is a serious blocker for us
> since we have many tables with near (and over) 1M partitions, and any query
> against these big tables wastes 5 minutes to get full partitions info.
>
> I briefly looked at the code, and it looks like resolving metastore
> relations is the first thing that the analyzer does prior to any other
> optimization rules such as partition pruning. So in the Hive metastore
> client, it ends up calling getAllPartitions() with no filter expression. I
> am wondering how much work will be involved to fix this issue. Can you
> please advise what you think should be done?
>
>
> On Mon, Apr 13, 2015 at 3:27 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> Yeah, we don't currently push down predicates into the metastore.  Though,
>> we do prune partitions based on predicates (so we don't read the data).
>>
>> On Mon, Apr 13, 2015 at 2:53 PM, Tom Graves <tgraves...@yahoo.com.invalid
>> >
>> wrote:
>>
>> > Hey,
>> > I was trying out spark sql using the HiveContext and doing a select on a
>> > partitioned table with lots of partitions (16,000+). It took over 6
>> minutes
>> > before it even started the job. It looks like it was querying the Hive
>> > metastore and got a good chunk of data back.  Which I'm guessing is
>> info on
>> > the partitions.  Running the same query using hive takes 45 seconds for
>> the
>> > entire job.
>> > I know spark sql doesn't support all the hive optimization.  Is this a
>> > known limitation currently?
>> > Thanks,Tom
>>
>
>

Re: Spark Sql reading hive partitioned tables?

Reply via email to