Sanjeet Malhotra created PHOENIX-7619: -----------------------------------------
Summary: Excess HFiles are being read to look for more than required column versions Key: PHOENIX-7619 URL: https://issues.apache.org/jira/browse/PHOENIX-7619 Project: Phoenix Issue Type: Bug Reporter: Sanjeet Malhotra Assignee: Sanjeet Malhotra Steps to reproduce: * Create table with one column family. {code:java} CREATE TABLE TEST.HBASE_READS( ID1 VARCHAR NOT NULL, ID2 VARCHAR, VAL1 VARCHAR CONSTRAINT PK PRIMARY KEY (ID1)) BLOOMFILTER = NONE;{code} * Write some data to the table and flush the table. So, that there is at least 1 HFile. (During my testing I ensured there are 3 HFiles per region.) * Write some more data to the table but this time don't flush the table. So, this data will stay in memstore. * Query a single row such that its in the data still in memstore but not in HFiles. So, the rows should come purely from memstore w/o even needing to read from HFile. Expectation: The queried row should come from memstore and there shouldn't be any need to read HFiles. Actual: Memstore along with all HFiles were scanned to get the Result back to the client. Reason: In HBase, when [StoreScanner is initialized |https://github.com/apache/hbase/blob/efa228ef446c0e63bbe2915a48d3324efab79ccc/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java#L266]then we go for lazy seek as Scan object coming from Phoenix specifies column qualifiers to be queried. If the StoreFile on which we are doing lazy seek has no deleteFamily or deleteFamilyVersion markers then [this line|https://github.com/apache/hbase/blob/b21ba71f73881336345fd5dd7d647910b3058e05/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java#L438] will be hit. Same will be done for all StoreFileScanners. While head of memstore scanner (SegmentScanner) will be at the first column of the given row. Next [this line|https://github.com/apache/hbase/blob/b21ba71f73881336345fd5dd7d647910b3058e05/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/querymatcher/ScanQueryMatcher.java#L192] will be hit until memstore scanner is the top most scanner in the priority queue of all the scanners: 3 StoreFile scanners and 1 memstore scanner. Once memstore Scanner is the top most scanner then first column being queried will be read from memstore and [this line will be hit|https://github.com/apache/hbase/blob/b21ba71f73881336345fd5dd7d647910b3058e05/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/querymatcher/ExplicitColumnTracker.java#L167] after successful column match. Here if {{maxVersions}} have been found then we skip to next column which again will be read from memstore. But if {{maxVersions}} are not found then the we go on to read the next version i.e. next cell which leads to scanning all the StoreFiles. In "User" scans {{maxVersions}} should have been {{1}} for us so, we should have skipped to the next column once we found the latest version of the current column in memstore. But for "User" scans {{maxVersions}} is {{INT_MAX}} for us leading to reading all the StoreFiles. We should have [hit this line|https://github.com/apache/hbase/blob/efa228ef446c0e63bbe2915a48d3324efab79ccc/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java#L746] but we end up [hitting this line|https://github.com/apache/hbase/blob/efa228ef446c0e63bbe2915a48d3324efab79ccc/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java#L704].The {{maxVersions}} is {{INT_MAX}} for us because we override it in [here|https://github.com/apache/phoenix/blob/9cb48832a7e9b9a972d682535179ab6a2fd0cb16/phoenix-core-server/src/main/java/org/apache/phoenix/coprocessor/BaseScannerRegionObserver.java#L432-L435]. The {{preStoreScannerOpen}} hook is called for "User" scans. So, we are penalizing all the "User" scans. Fix for preStoreScannerOpen() hook: * Don't override MIN_VERSIONS and VERSIONS. * Set TTL to {{Long.MAX_VALUE}} instead of {{HConstants.FOREVER}} . This is needed because {{HConstant.FOREVER}} is INT_MAX and the TTL overridden as part of ScanOptions is interpreted in milliseconds by HBase. INT_MAX value in ms is equivalent to a little less than 25 days. So, HBase will treat even latest version of a column qualifier as expired if its older than 25 days. This can cause rows to partially expire. Currently, rows are not expiring partially because we set MIN_VERSIONS in this hook to INT_MAX. Once we stop overriding MIN_VERSIONS we need to set TTL to Long.MAX_VALUE as TTL's data type is long. Verified this via IT. * Continue overriding {{KeepDeletedCells}} to {{{}TTL{}}}. If we stop doing this then SCN queries will get impacted. Scenario: We keep KeepDeletedCells as {{False.}} Say at T1 timestamp I wrote a row and at T2 > T1 I delete the row. Now suppose I set my SCN value to a timestamp b/w T2 and T1 then expectation is I should see the inserted row but I won't because to see past delete markers when custom time range is specified in scan I need to set KeepDeletedCells to a value other than {{False}} . I verified this via IT. -- This message was sent by Atlassian Jira (v8.20.10#820010)