Sanjeet Malhotra created PHOENIX-7619:
-----------------------------------------
Summary: Excess HFiles are being read to look for more than
required column versions
Key: PHOENIX-7619
URL: https://issues.apache.org/jira/browse/PHOENIX-7619
Project: Phoenix
Issue Type: Bug
Reporter: Sanjeet Malhotra
Assignee: Sanjeet Malhotra
Steps to reproduce:
* Create table with one column family.
{code:java}
CREATE TABLE TEST.HBASE_READS( ID1 VARCHAR NOT NULL, ID2 VARCHAR, VAL1 VARCHAR
CONSTRAINT PK PRIMARY KEY (ID1)) BLOOMFILTER = NONE;{code}
* Write some data to the table and flush the table. So, that there is at least
1 HFile. (During my testing I ensured there are 3 HFiles per region.)
* Write some more data to the table but this time don't flush the table. So,
this data will stay in memstore.
* Query a single row such that its in the data still in memstore but not in
HFiles. So, the rows should come purely from memstore w/o even needing to read
from HFile.
Expectation: The queried row should come from memstore and there shouldn't be
any need to read HFiles.
Actual: Memstore along with all HFiles were scanned to get the Result back to
the client.
Reason:
In HBase, when [StoreScanner is initialized
|https://github.com/apache/hbase/blob/efa228ef446c0e63bbe2915a48d3324efab79ccc/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java#L266]then
we go for lazy seek as Scan object coming from Phoenix specifies column
qualifiers to be queried. If the StoreFile on which we are doing lazy seek has
no deleteFamily or deleteFamilyVersion markers then [this
line|https://github.com/apache/hbase/blob/b21ba71f73881336345fd5dd7d647910b3058e05/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java#L438]
will be hit. Same will be done for all StoreFileScanners. While head of
memstore scanner (SegmentScanner) will be at the first column of the given row.
Next [this
line|https://github.com/apache/hbase/blob/b21ba71f73881336345fd5dd7d647910b3058e05/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/querymatcher/ScanQueryMatcher.java#L192]
will be hit until memstore scanner is the top most scanner in the priority
queue of all the scanners: 3 StoreFile scanners and 1 memstore scanner. Once
memstore Scanner is the top most scanner then first column being queried will
be read from memstore and [this line will be
hit|https://github.com/apache/hbase/blob/b21ba71f73881336345fd5dd7d647910b3058e05/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/querymatcher/ExplicitColumnTracker.java#L167]
after successful column match. Here if {{maxVersions}} have been found then we
skip to next column which again will be read from memstore. But if
{{maxVersions}} are not found then the we go on to read the next version i.e.
next cell which leads to scanning all the StoreFiles. In "User" scans
{{maxVersions}} should have been {{1}} for us so, we should have skipped to the
next column once we found the latest version of the current column in memstore.
But for "User" scans {{maxVersions}} is {{INT_MAX}} for us leading to reading
all the StoreFiles. We should have [hit this
line|https://github.com/apache/hbase/blob/efa228ef446c0e63bbe2915a48d3324efab79ccc/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java#L746]
but we end up [hitting this
line|https://github.com/apache/hbase/blob/efa228ef446c0e63bbe2915a48d3324efab79ccc/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java#L704].The
{{maxVersions}} is {{INT_MAX}} for us because we override it in
[here|https://github.com/apache/phoenix/blob/9cb48832a7e9b9a972d682535179ab6a2fd0cb16/phoenix-core-server/src/main/java/org/apache/phoenix/coprocessor/BaseScannerRegionObserver.java#L432-L435].
The {{preStoreScannerOpen}} hook is called for "User" scans. So, we are
penalizing all the "User" scans.
Fix for preStoreScannerOpen() hook:
* Don't override MIN_VERSIONS and VERSIONS.
* Set TTL to {{Long.MAX_VALUE}} instead of {{HConstants.FOREVER}} . This is
needed because {{HConstant.FOREVER}} is INT_MAX and the TTL overridden as part
of ScanOptions is interpreted in milliseconds by HBase. INT_MAX value in ms is
equivalent to a little less than 25 days. So, HBase will treat even latest
version of a column qualifier as expired if its older than 25 days. This can
cause rows to partially expire. Currently, rows are not expiring partially
because we set MIN_VERSIONS in this hook to INT_MAX. Once we stop overriding
MIN_VERSIONS we need to set TTL to Long.MAX_VALUE as TTL's data type is long.
Verified this via IT.
* Continue overriding {{KeepDeletedCells}} to {{{}TTL{}}}. If we stop doing
this then SCN queries will get impacted. Scenario: We keep KeepDeletedCells as
{{False.}} Say at T1 timestamp I wrote a row and at T2 > T1 I delete the row.
Now suppose I set my SCN value to a timestamp b/w T2 and T1 then expectation is
I should see the inserted row but I won't because to see past delete markers
when custom time range is specified in scan I need to set KeepDeletedCells to a
value other than {{False}} . I verified this via IT.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)