[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891274#action_12891274
 ] 

Ning Zhang commented on HIVE-417:
---------------------------------

Based on some internal discussions below are some comments about the design doc:

1) the staleness (inconsistency) between the index and the base table should be 
addressed more precisely. 
   Since the current implementation allows the user to query the index table 
directly, we should guarantee that the index is consistent with the base table 
at the query time. This means at the query START time, the index was built 
completely based on the data stored in the base table. The current design does 
not satisfy this criteria in that it only record the last_modification_time 
(LMT) of the base table and the index table, and check if the latter is larger 
than the former. This leaves the following example break:

timestamp0: last update of partition P1
timestamp1: start create index on partition P1
timestamp2: start insert overwrite P1
timestamp3: finish insert overwrite P1
timestamp4: finish index creation on P1
timestamp 5: query on P1

The LMTs of the index and the base table are timestamp4 and timestamp3 
respectively so the optimizer will conclude the index is consistent with base 
table. However, the index was built based on stale data at the timestamp5. So 
the index should not be used. 

Instead of recording the LMT of the index table, we probably should record the 
LMT of the base table in the index metadata at the beginning of the index 
creation.  In the above example, the timestamp recorded in the index metadata 
should be timestamp0. This means the index was created based on the base table 
at timestamp0. At the query time, we should check timestamp0 against timestamp 
3, which correctly conclude the index is stale. 

BTW, all the timestamp should be coming from some centralized clock such as the 
DFS directory update time (from the namenode).

2) The above consistency problem does not only present in the case of "DEFERRED 
REBUILD". Even if the index rebuild starts right away after INSERT OVERWRITE, 
there is still a time window that the index is stale (before the index creation 
is complete). So we need the same mechanism to figure out stale indexes. 

3) I think a lock-based concurrency may not be the best choice as well. If the 
index creation takes a long time, it defers the availability of the base table. 
If we have the optimizer, we should always query against the base tables, and 
let the optimizer to figure out whether an index is available and fresh. So if 
an index creation is not finished, we can just use the base table, otherwise we 
can use the index if the cost is less expensive. 

4) Another case is that if the index creation finished and the query is using 
the index, and then an DML happened on the base table and finished before the 
query finish. Here we only guarantee snapshot consistency (results consisting 
with the data at the beginning of the query, not after the query). 

5) If we have the mechanism to check consistency of the index, then the "index 
rebuild" command could just return if the index is consistent. We can also 
allow a "force" option in case we need to compensate for bad metadata. 

> Implement Indexing in Hive
> --------------------------
>
>                 Key: HIVE-417
>                 URL: https://issues.apache.org/jira/browse/HIVE-417
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore, Query Processor
>    Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
>            Reporter: Prasad Chakka
>            Assignee: He Yongqiang
>         Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, 
> hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, 
> hive-indexing.5.thrift.patch, idx2.png, 
> indexing_with_ql_rewrites_trunk_953221.patch
>
>
> Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to