[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891274#action_12891274 ]
Ning Zhang commented on HIVE-417: --------------------------------- Based on some internal discussions below are some comments about the design doc: 1) the staleness (inconsistency) between the index and the base table should be addressed more precisely. Since the current implementation allows the user to query the index table directly, we should guarantee that the index is consistent with the base table at the query time. This means at the query START time, the index was built completely based on the data stored in the base table. The current design does not satisfy this criteria in that it only record the last_modification_time (LMT) of the base table and the index table, and check if the latter is larger than the former. This leaves the following example break: timestamp0: last update of partition P1 timestamp1: start create index on partition P1 timestamp2: start insert overwrite P1 timestamp3: finish insert overwrite P1 timestamp4: finish index creation on P1 timestamp 5: query on P1 The LMTs of the index and the base table are timestamp4 and timestamp3 respectively so the optimizer will conclude the index is consistent with base table. However, the index was built based on stale data at the timestamp5. So the index should not be used. Instead of recording the LMT of the index table, we probably should record the LMT of the base table in the index metadata at the beginning of the index creation. In the above example, the timestamp recorded in the index metadata should be timestamp0. This means the index was created based on the base table at timestamp0. At the query time, we should check timestamp0 against timestamp 3, which correctly conclude the index is stale. BTW, all the timestamp should be coming from some centralized clock such as the DFS directory update time (from the namenode). 2) The above consistency problem does not only present in the case of "DEFERRED REBUILD". Even if the index rebuild starts right away after INSERT OVERWRITE, there is still a time window that the index is stale (before the index creation is complete). So we need the same mechanism to figure out stale indexes. 3) I think a lock-based concurrency may not be the best choice as well. If the index creation takes a long time, it defers the availability of the base table. If we have the optimizer, we should always query against the base tables, and let the optimizer to figure out whether an index is available and fresh. So if an index creation is not finished, we can just use the base table, otherwise we can use the index if the cost is less expensive. 4) Another case is that if the index creation finished and the query is using the index, and then an DML happened on the base table and finished before the query finish. Here we only guarantee snapshot consistency (results consisting with the data at the beginning of the query, not after the query). 5) If we have the mechanism to check consistency of the index, then the "index rebuild" command could just return if the index is consistent. We can also allow a "force" option in case we need to compensate for bad metadata. > Implement Indexing in Hive > -------------------------- > > Key: HIVE-417 > URL: https://issues.apache.org/jira/browse/HIVE-417 > Project: Hadoop Hive > Issue Type: New Feature > Components: Metastore, Query Processor > Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 > Reporter: Prasad Chakka > Assignee: He Yongqiang > Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, > hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, > hive-indexing.5.thrift.patch, idx2.png, > indexing_with_ql_rewrites_trunk_953221.patch > > > Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.