[
https://issues.apache.org/jira/browse/PHOENIX-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734512#comment-16734512
]
Kadir OZDEMIR commented on PHOENIX-5018:
----------------------------------------
*Background*
There are two ways to fully build an index: synchronously as part of the index
create command, and asynchronously using IndexTool. The synchronous full build
is done by using an UPSERT SELECT statement where data is read from the data
table and inserted into the index table. The UPSERT SELECT code path does not
include any logic special to index rebuild, and thus, the values for HBase
timestamps for the insert operation are taken from the current wall clock. This
is the root cause for getting incorrect timestamps when index mutations
generated by UPSERT SELECT.
IndexTool is implemented using MapReduce. There are two types of mapper for
IndexTool: PhoenixIndexImportMapper and PhoenixIndexImportDirectMapper. The
former is used for the bulk loading option and the latter for the direct option
of IndexTool. For both of these options, PhoenixIndexDBWriatable is used to
read and write table rows. Reading rows is done through the ResultSet interface
and writing is done through the PreparedStatement interface. Here, scanning of
the data table, i.e., the SELECT statement, and inserting new rows into the
index table, (the UPSERT statement) are executed separately. Since the
PreparedStatement interface and thus the PhoenixPreparedStatement class do not
have a method to specify the timestamps for individual columns, the index table
rows gets the timestamps from the current wall clock. This is the root cause
for getting incorrect timestamps when index mutations generated by using an
UPSERT statement.
In addition to building index fully, there is also a way to rebuild index
partially to recover from index write failures which can happen during data
table updates. When such a failure happens on an index table, the index is
disabled by setting the INDEX_STATE column of the row corresponding to the
index table in SYSTEM.CATALOG. Also the timestamp associated with the failed
write is recorded in the INDEX_DISABLE_TIMESTAMP column of the same row. This
is done in the UpdateIndexState method of IndexUtil.
MetaDataRegionObserver runs a periodic task to go through SYSTEM.CATALOG to
identify index tables that are ready for index rebuild. Using the
INDEX_DISABLE_TIMESTAMP value, the rows of the data table is identified to be
replayed to rebuild the index. During these replay writes, the timestamps
values in the data table is correctly passed to the index table. In other
words, the partial index rebuild does not have the timestamp problem existing
in full index rebuilds.
*Alternative Solutions*
There two main approaches to solve the timestamp problem.
The first one is to enhance the PhoenixIndexDBWriatable class and the UPSERT
and UPSERT SELECT code paths with the ability to retrieve and set HBase cell
timestamps. This requires adding a new hint (e.g., TIMESTAMP) to UPSERT and
UPSERT SELECT statements. This hint for UPSERT implies that the provided
timestamp values should be used instead of using the current wall clock. For
UPSERT SELECT, the hint means that cell timestamps should be retrieved by the
SELECT statement and passed to the UPSERT statement. The following changes have
been identified after a high level inspection of the code.
PhoenixIndexDBWritable:
* Change the type of the rowTs property from long to List<Long>
* Change the methods accessing rowTs (i.e., write and getRowTs) accordingly
* Add the timestamp hint the UPSERT statement
* Pass the timestamps to the PreparedStatement argument. To do that unwrap it
as PhoenixPreparedStatement and call setParamaterTimestamp(int parameterIndex,
long timestamp) that would be introduced as part of this solution alternative
PhoenixPreparedStatment:
* Add a method (e.g., void setParameterTimestamp(int parameterIndex, long
timestamp))
* Add a property called timestamps (List<long> timestamps)
PhoenixSQL grammar:
* Add a new hint for preserving timestamps when populating an index from a
data table, e.g., TIMESTAMP
HintNode:
* Add the new hint for timestamps
UpsertCompiler:
* Various part of the UpsertCompliler code needs to be changed when the
timestamp hint is given (1) to retrieve cell timestamps and create mutations
with these timestamps for the UPSERT SELECT statements and (2) to use the
parameter timestamps in PhoenixPreparedStatment objects to create the mutations
for UPSERT statements.
This list may not be complete and more changes may be required. As can be seen
from the identified changes, the first alternative requires surgical changes in
the very core of Phoenix. It requires changes in the SQL grammar (in the
external API) which may be solely used for internal purposes.
The second alternative solution is to use MetaDataRegionObserver for both full
index build and index partial rebuild. This solution implies abandoning using
UPSERT SELECT for synchronous index full build and MapReduce for asynchronous
index full build. Actually, with this alternative, we do not need to make a
distinction between partial and full index builds or asynchronous or
synchronous builds since there will be one code path that is used for all
purposes. This solution requires setting INDEX_DISABLE_TIMESTAMP to a
predefined value (or data table creation time) after an index table is created
so that all the rows of the data table will considered by
MetaDataRegionObserver (i.e., the partial rebuild will be turn into full index
build). This also implies that all index full builds will be asynchronous,
i.e., the full build will not happen in the context of index table create.
The second solution requires less code changes and simplifies the current index
implementation by eliminating the code and bugs for IndexTool. The main
drawback of the second solution is that current MetaDataRegionObserver is
single threaded and rebuilds one index at a time. However, it can be enhanced
to be multithreaded which will benefit not only full index builds but also
index rebuilds. Thus, I am inclining to implement the second alternative but
would like to hear comments as the implications of the second alternative go
beyond fixing this issue. I also want to make sure I did not miss something
important in my analysis, [~gjacoby],[~vincentpoon].
> Index mutations created by IndexTool will have wrong timestamps
> ---------------------------------------------------------------
>
> Key: PHOENIX-5018
> URL: https://issues.apache.org/jira/browse/PHOENIX-5018
> Project: Phoenix
> Issue Type: Bug
> Affects Versions: 4.14.0, 5.0.0
> Reporter: Geoffrey Jacoby
> Assignee: Kadir OZDEMIR
> Priority: Major
>
> When doing a full rebuild (or initial async build) on an index using the
> IndexTool and PhoenixIndexImportDirectMapper, we generate the index mutations
> by creating an UPSERT SELECT query from the base table to the index, then
> taking the Mutations from it and inserting it directly into the index via an
> HBase HTable.
> The timestamps of the Mutations use the default HBase behavior, which is to
> take the current wall clock. However, the timestamp of an index KeyValue
> should use the timestamp of the initial KeyValue in the base table.
> Having base table and index timestamps out of sync can cause all sorts of
> weird side effects, such as if the base table has data with an expired TTL
> that isn't expired in the index yet.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)