[
https://issues.apache.org/jira/browse/HIVE-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000145#comment-13000145
]
Prajakta Kalmegh commented on HIVE-1694:
----------------------------------------
We are currently working on implementing a new index type to get a correct
COUNT for group-by queries that are re-written to use index table instead of
base table. We have three implementation options as listed below:
1. Create a new index handler called 'AggregationIndexHandler' which
pre-computes the count on indexed columns
In this appraoch, we plan to implement a new index type that stores the
COUNT of the indexed column per index entry as suggested by John in November
1st 2010's comment. The 'AggregationIndexHandler' will override the default
implementation for 'analyzeIndexDefinition(...)' and
'generateIndexBuildTaskList(...)' methods. The 'analyzeIndexDefinition(...)'
method implementation will add a FieldSchema object for COUNT rather than
'offsets' in the StorageDescriptor for the index table. The
'generateIndexBuildTaskList(...)' method will be the same but there will be a
change in the implementation of the 'getIndexBuilderMapRedTask(...)' method
which is invoked within this method.
Currently, for the index rebuild on the following index creation,
CREATE INDEX lineitem_lshipdate_idx ON TABLE lineitem(l_shipdate) AS
'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED
REBUILD;
ALTER INDEX lineitem_lshipdate_idx ON lineitem REBUILD;
the 'getIndexBuilderMapRedTask(...)' method creates a command 'INSERT OVERWRITE
TABLE `default__lineitem_lineitem_lshipdate_idx__` SELECT
`l_shipdate`,INPUT__FILE__NAME, collect_set (BLOCK__OFFSET__INSIDE__FILE) FROM
`lineitem` GROUP BY `l_shipdate`, INPUT__FILE__NAME' in the
CompactIndexHandler. This command is later passed to the driver to compile and
rebuild the index which collects data from base table and stores it in index
table.
With the 'AggregationIndexHandler', we plan to create a command like ' INSERT
OVERWRITE TABLE `default__lineitem_lineitem_lshipdate_idx__` SELECT
`l_shipdate`,INPUT__FILE__NAME, COUNT(`l_shipdate`) FROM `lineitem` GROUP BY
`l_shipdate`, INPUT__FILE__NAME' which will count the number of unique indexed
column entries.
In the current implementation of our optimizer, we are doing a
'size(_offsets)' to replace the COUNT in group-by queries. We need to change
this implementation if we use the AggregationIndexHandler. This will still give
a comparable performance improvement on the original queries that will,
otherwise, scan full tables for a COUNT of the indexed columns.
Note: We plan to go ahead with creating a new HiveIndexHandler type rather than
changing the implementation of the CompactIndexHandler to let in some
flexibility in the use of AggregationIndexHandler only for cases where it is
required.
2. Implement a 'CompactRowIndexHandler' similar to 'CompactIndexHandler' but it
stores the ROW_OFFSET_INSIDE_FILE instead of BLOCK_OFFSET_INSIDE_FILE
We create a new index handler called 'CompactRowIndexHandler'. This
will give us a ROW_OFFSET_INSIDE_FILE array<bigint> and we can compute the
COUNT for group-by queries by doing a s'ize(ROW_OFFSET_INSIDE_FILE)' on it.
The command will look like,
' INSERT OVERWRITE TABLE `default__lineitem_lineitem_lshipdate_idx__` SELECT
`l_shipdate`,INPUT__FILE__NAME, collect_set (ROW_OFFSET_INSIDE_FILE) FROM
`lineitem` GROUP BY `l_shipdate`, INPUT__FILE__NAME' which will give us an
array of all row offsets in the input file grouped by indexed column.
We had a look at the HIVE-1803 patch. If we re-use the implementation changes
done in Hive1803 to classes HiveContextAwareRecordReader.java, IOContext.java,
VirtualColumn.java and MapOperator.java, we can implement
'CompactRowIndexHandler.java' to support the above index type.
3. Use BitMapIndexHandler without creating a new index handler
Another approach is to use the Hive1803 BitMapIndexHandler to get a
count of all those bits in the bitmap array for which the row_offset is ON.
This will indeed give us a correct count for all the rows that contain the
indexed column. However, we will have multiple counts per block entry. For
this, we may need to do the SUM again to get a correct count per indexed column
entry. Although this is do-able, it will be somewhat complex to implement.
Please let us know what you think of the proposed solutions. If you have a
better implementation approach in mind, please do share it with us.
> Accelerate GROUP BY execution using indexes
> -------------------------------------------
>
> Key: HIVE-1694
> URL: https://issues.apache.org/jira/browse/HIVE-1694
> Project: Hive
> Issue Type: New Feature
> Components: Indexing, Query Processor
> Affects Versions: 0.7.0
> Reporter: Nikhil Deshpande
> Assignee: Nikhil Deshpande
> Attachments: HIVE-1694.1.patch.txt, HIVE-1694_2010-10-28.diff,
> demo_q1.hql, demo_q2.hql
>
>
> The index building patch (Hive-417) is checked into trunk, this JIRA issue
> tracks supporting indexes in Hive compiler & execution engine for SELECT
> queries.
> This is in ref. to John's comment at
> https://issues.apache.org/jira/browse/HIVE-417?focusedCommentId=12884869&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12884869
> on creating separate JIRA issue for tracking index usage in optimizer & query
> execution.
> The aim of this effort is to use indexes to accelerate query execution (for
> certain class of queries). E.g.
> - Filters and range scans (already being worked on by He Yongqiang as part of
> HIVE-417?)
> - Joins (index based joins)
> - Group By, Order By and other misc cases
> The proposal is multi-step:
> 1. Building index based operators, compiler and execution engine changes
> 2. Optimizer enhancements (e.g. cost-based optimizer to compare and choose
> between index scans, full table scans etc.)
> This JIRA initially focuses on the first step. This JIRA is expected to hold
> the information about index based plans & operator implementations for above
> mentioned cases.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira