[ 
https://issues.apache.org/jira/browse/HIVE-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970058#action_12970058
 ] 

John Sichi commented on HIVE-1694:
----------------------------------

I talked to Namit, and he thinks there should be no relevant dependencies on 
the QB once we start on optimization, so letting it get out of sync with the 
operator DAG may not be an issue.  (I scanned the code in optimizer, and it 
seems like a few dependencies have crept in, but only for special cases like 
ANALYZE.)

For issue #1, you are proposing what I'll call the "internal SQL" approach, 
which is to construct an internal SQL expression (either in string or ASTNode 
form) and then partially analyze that (via SemanticAnalyzer), producing an 
operator DAG to be spliced into the main one.  For this approach, we would need 
to figure out how to make the relevant phases of SemanticAnalyzer modularized 
and invocable.

Alternately, the "direct construction" approach would be to attempt to 
construct the new operator subgraph directly via custom code targeted to the 
specific patterns you generate, and then splice that in.

I'm not sure which approach is better; Namit, any opinions?  The internal SQL 
approach definitely seems the most appropriate for the WHERE clause work being 
done by the Harvey Mudd team, since it produces a self-contained job to be run 
to produce the temp table containing the filtered block list.  But for GROUP 
BY, the direct construction approach may be cleaner.

For issue #2, it seems like this would happen automatically for the internal 
SQL approach (but this could also pollute the SemanticAnalyzer state to some 
extent).  The direct construction approach is the opposite:  it avoids 
polluting SemanticAnalyzer, but still might require modularizing some 
SemanticAnalyzer calls, e.g. for generating and registering the necessary 
aliases for index tables.

Regarding issue #3, that's already true for other optimizations such as 
projection pushdown (ColumnPruner), which modifies operator row 
schemas/resolvers; see for example ColumnPrunerProcFactory.pruneJoinOperator.  
So there shouldn't be anything new here.

Regarding the need to run your transformation first, it would be best to avoid 
this since a more advanced optimizer may want freedom in reordering 
transformations.  So instead of relying on information from the QB, analyze the 
relevant operator subgraph to decide whether your transformation is applicable. 
 This is the approach we expect to require for cost-based optimization.

Also, note that from a lineage perspective, it makes more sense for lineage to 
be derived prior to index transformation rather than subsequently.  Someone 
examining the lineage associated with an ETL job would typically be more 
interested in the logical source table from which it pulls (rather than from a 
physical index).


> Accelerate query execution using indexes
> ----------------------------------------
>
>                 Key: HIVE-1694
>                 URL: https://issues.apache.org/jira/browse/HIVE-1694
>             Project: Hive
>          Issue Type: New Feature
>          Components: Indexing, Query Processor
>    Affects Versions: 0.7.0
>            Reporter: Nikhil Deshpande
>            Assignee: Nikhil Deshpande
>         Attachments: demo_q1.hql, demo_q2.hql, HIVE-1694_2010-10-28.diff
>
>
> The index building patch (Hive-417) is checked into trunk, this JIRA issue 
> tracks supporting indexes in Hive compiler & execution engine for SELECT 
> queries.
> This is in ref. to John's comment at
> https://issues.apache.org/jira/browse/HIVE-417?focusedCommentId=12884869&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12884869
> on creating separate JIRA issue for tracking index usage in optimizer & query 
> execution.
> The aim of this effort is to use indexes to accelerate query execution (for 
> certain class of queries). E.g.
> - Filters and range scans (already being worked on by He Yongqiang as part of 
> HIVE-417?)
> - Joins (index based joins)
> - Group By, Order By and other misc cases
> The proposal is multi-step:
> 1. Building index based operators, compiler and execution engine changes
> 2. Optimizer enhancements (e.g. cost-based optimizer to compare and choose 
> between index scans, full table scans etc.)
> This JIRA initially focuses on the first step. This JIRA is expected to hold 
> the information about index based plans & operator implementations for above 
> mentioned cases. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to