[ 
https://issues.apache.org/jira/browse/HIVE-10891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569286#comment-14569286
 ] 

Christian Dietze commented on HIVE-10891:
-----------------------------------------

It seems that the 
[SimpleFetchOptimizer|https://github.com/apache/hive/blob/branch-1.1/ql/src/java/org/apache/hadoop/hive/ql/optimizer/SimpleFetchOptimizer.java]
 acts a little bit to aggressive here. From my understanding of the code 
there's a check if the filter only affects columns that are partition keys. In 
this case the threshold check is bypassed (see [line 147 of 
SimpleFetchOptimizer|https://github.com/apache/hive/blob/branch-1.1/ql/src/java/org/apache/hadoop/hive/ql/optimizer/SimpleFetchOptimizer.java#L147]).
 In the upper query, we filter on a different column, nevertheless the filter 
is bypassed due to [these 
lines|https://github.com/apache/hive/blob/branch-1.1/ql/src/java/org/apache/hadoop/hive/ql/optimizer/SimpleFetchOptimizer.java#L200]:
{code:java}
if (PartitionPruner.onlyContainsPartnCols(table, pruner)) {
    bypassFilter = !pctx.getPrunedPartitions(alias, ts).hasUnknownPartitions();
}
{code}

A workaround seems to be, to put the optimizer on a leash by setting 

{code:xml}
<property>
    <name>hive.fetch.task.conversion</name>
    <value>minimal</value>
</property>
{code}

> Limited fetch on partitioned table can eat up all heap
> ------------------------------------------------------
>
>                 Key: HIVE-10891
>                 URL: https://issues.apache.org/jira/browse/HIVE-10891
>             Project: Hive
>          Issue Type: Bug
>          Components: Physical Optimizer
>    Affects Versions: 1.1.0
>            Reporter: Christoph Lipka
>
> When doing a query like 
> {code}
> select *
> from partitioned_table
> where not_the_partition_key_column = "xyz"
> limit 100
> {code}
> it is executed in memory. For all tables except the smallest this behavior 
> quickly consumes the complete heap and crashes the server.
> If the limit clause is omitted, a mr-job is started and the query is executed 
> without memory issues. One can also work around this problem by extending the 
> query to also select the partition_key like 
> {code}
> select *
> from partitioned_table a
> where a.not_the_partition_key_column = "xyz"
> and a.partition_key_column = (select b.partition_key_column from 
> partitioned_table b)
> limit 100
> {code}
> In this case hive also creates a mr-job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to