[ 
https://issues.apache.org/jira/browse/PHOENIX-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734512#comment-16734512
 ] 

Kadir OZDEMIR commented on PHOENIX-5018:
----------------------------------------

*Background*

There are two ways to fully build an index: synchronously as part of the index 
create command, and asynchronously using IndexTool. The synchronous full build 
is done by using an UPSERT SELECT statement where data is read from the data 
table and inserted into the index table. The UPSERT SELECT code path does not 
include any logic special to index rebuild, and thus, the values for HBase 
timestamps for the insert operation are taken from the current wall clock. This 
is the root cause for getting incorrect timestamps when index mutations 
generated by UPSERT SELECT.

IndexTool is implemented using MapReduce. There are two types of mapper for 
IndexTool: PhoenixIndexImportMapper and PhoenixIndexImportDirectMapper. The 
former is used for the bulk loading option and the latter for the direct option 
of IndexTool. For both of these options, PhoenixIndexDBWriatable is used to 
read and write table rows. Reading rows is done through the ResultSet interface 
and writing is done through the PreparedStatement interface. Here, scanning of 
the data table, i.e., the SELECT statement, and inserting new rows into the 
index table, (the UPSERT statement) are executed separately. Since the 
PreparedStatement interface and thus the PhoenixPreparedStatement class do not 
have a method to specify the timestamps for individual columns, the index table 
rows gets the timestamps from the current wall clock. This is the root cause 
for getting incorrect timestamps when index mutations generated by using an 
UPSERT statement.

In addition to building index fully, there is also a way to rebuild index 
partially to recover from index write failures which can happen during data 
table updates. When such a failure happens on an index table, the index is 
disabled by setting the INDEX_STATE column of the row corresponding to the 
index table in SYSTEM.CATALOG.  Also the timestamp associated with the failed 
write is recorded in the INDEX_DISABLE_TIMESTAMP column of the same row. This 
is done in the UpdateIndexState method of IndexUtil. 

MetaDataRegionObserver runs a periodic task to go through SYSTEM.CATALOG to 
identify index tables that are ready for index rebuild. Using the 
INDEX_DISABLE_TIMESTAMP value, the rows of the data table is identified to be 
replayed to rebuild the index. During these replay writes, the timestamps 
values in the data table is correctly passed to the index table. In other 
words, the partial index rebuild does not have the timestamp problem existing 
in full index rebuilds. 

*Alternative Solutions*

There two main approaches to solve the timestamp problem. 

The first one is to enhance the PhoenixIndexDBWriatable class and the UPSERT 
and UPSERT SELECT code paths with the ability to retrieve and set HBase cell 
timestamps. This requires adding a new hint (e.g., TIMESTAMP) to UPSERT and 
UPSERT SELECT statements. This hint for UPSERT implies that the provided 
timestamp values should be used instead of using the current wall clock. For 
UPSERT SELECT, the hint means that cell timestamps should be retrieved by the 
SELECT statement and passed to the UPSERT statement. The following changes have 
been identified after a high level inspection of the code. 

PhoenixIndexDBWritable:
 * Change the type of the rowTs property from long to List<Long>
 * Change the methods accessing rowTs (i.e., write and getRowTs) accordingly
 * Add the timestamp hint the UPSERT statement
 * Pass the timestamps to the PreparedStatement argument. To do that unwrap it 
as PhoenixPreparedStatement and call setParamaterTimestamp(int parameterIndex, 
long timestamp) that would be introduced as part of this solution alternative

PhoenixPreparedStatment:
 * Add a method (e.g., void setParameterTimestamp(int parameterIndex, long 
timestamp))
 * Add a property called timestamps (List<long> timestamps)

PhoenixSQL grammar:
 * Add a new hint for preserving timestamps when populating an index from a 
data table, e.g., TIMESTAMP

HintNode:
 * Add the new hint for timestamps

UpsertCompiler:
 * Various part of the UpsertCompliler code needs to be changed when the 
timestamp hint is given (1) to retrieve cell timestamps and create mutations 
with these timestamps for the UPSERT SELECT statements and (2) to use the 
parameter timestamps in PhoenixPreparedStatment objects to create the mutations 
for UPSERT statements.

This list may not be complete and more changes may be required. As can be seen 
from the identified changes, the first alternative requires surgical changes in 
the very core of Phoenix. It requires changes in the SQL grammar (in the 
external API) which may be solely used for internal purposes.

The second alternative solution is to use MetaDataRegionObserver for both full 
index build and index partial rebuild. This solution implies abandoning using 
UPSERT SELECT for synchronous index full build and MapReduce for asynchronous 
index full build. Actually, with this alternative, we do not need to make a 
distinction between partial and full index builds or asynchronous or 
synchronous builds since there will be one code path that is used for all 
purposes. This solution requires setting INDEX_DISABLE_TIMESTAMP to a 
predefined value (or data table creation time) after an index table is created 
so that all the rows of the data table will considered by 
MetaDataRegionObserver (i.e., the partial rebuild will be turn into full index 
build). This also implies that all index full builds will be asynchronous, 
i.e., the full build will not happen in the context of index table create. 

The second solution requires less code changes and simplifies the current index 
implementation by eliminating the code and bugs for IndexTool. The main 
drawback of the second solution is that current MetaDataRegionObserver is 
single threaded and rebuilds one index at a time. However, it can be enhanced 
to be multithreaded which will benefit not only full index builds but also 
index rebuilds. Thus, I am inclining to implement the second alternative but 
would like to hear comments as the implications of the second alternative go 
beyond fixing this issue. I also want to make sure I did not miss something 
important in my analysis, [~gjacoby],[~vincentpoon].

> Index mutations created by IndexTool will have wrong timestamps
> ---------------------------------------------------------------
>
>                 Key: PHOENIX-5018
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-5018
>             Project: Phoenix
>          Issue Type: Bug
>    Affects Versions: 4.14.0, 5.0.0
>            Reporter: Geoffrey Jacoby
>            Assignee: Kadir OZDEMIR
>            Priority: Major
>
> When doing a full rebuild (or initial async build) on an index using the 
> IndexTool and PhoenixIndexImportDirectMapper, we generate the index mutations 
> by creating an UPSERT SELECT query from the base table to the index, then 
> taking the Mutations from it and inserting it directly into the index via an 
> HBase HTable. 
> The timestamps of the Mutations use the default HBase behavior, which is to 
> take the current wall clock. However, the timestamp of an index KeyValue 
> should use the timestamp of the initial KeyValue in the base table.
> Having base table and index timestamps out of sync can cause all sorts of 
> weird side effects, such as if the base table has data with an expired TTL 
> that isn't expired in the index yet. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to