Davis-Zhang-Onehouse commented on PR #13523:
URL: https://github.com/apache/hudi/pull/13523#issuecomment-3049806624

   > Only took a 2 min skim. Extending the expressions sg, but should we do it 
using sth custom? is there a standard relational expression that can used in 
structure and naming
   
   # Do I understand your idea correctly?
   
   ## Problem statement
   We need 2 abstraction:
   - [Customized seekKey][nice to have] customized way of building the seekKey 
for reader.seekTo. To lookup SI key "secKey", the seekKey should be "secKey$". 
This helps skip many irrelevant records and boost lookup perf as we can skip 
irrelevant records and even data blocks, especially if there are many records 
sharing similar prefix.
   - [Customized index record matching] [must have] customized way of key 
matching. To match SI index record key "secKey$recKey" with lookup key 
"secKey", we must 
do`getUnescapedSecondaryKeyFromSecondaryIndexKey(recordKey).equals(lookupKey)`. 
This is customized logic.
   
   We need to see how to provide such abstractions. It is FG reader interface 
level change we should discuss once and implement only once.
   
   You seem to suggest something as below:
   
   Use Expression Builder:
     Instead of: transformKeysToPredicateByOperator(keys, 
SECONDARY_INDEX_KEY_MATCH)
     Use: Predicates.in(
             Functions.substringBeforeSpliter(column, "$"), 
             keys
           )
   
   1. Remove custom operator: Instead of SECONDARY_INDEX_KEY_MATCH, use a 
combination of:
       - A string transformation function (e.g., SUBSTRING_BEFORE)
       - Standard EQUALS operator
     2. Introduce String Functions:
     - Add SUBSTRING_BEFORE_SPLITER(expr, delimiter) function, which is mroe 
generic replacing customized getUnescapedSecondaryKeyFromSecondaryIndexKey 
     - Add proper handling for escaped delimiters
     - This follows standard SQL function patterns
     3. Transform at Expression Level:
     Current: SECONDARY_INDEX_KEY_MATCH(column, ['key1', 'key2'])
     New: SUBSTRING_BEFORE_SPLITER(indexRecordKey, '$') IN ['key1', 'key2']
   
   
   # What I need to think about
   
   How to abstract "[Customized seekKey]" - need to think more, no plan yet.
   How to abstract "[Customized index record matching]" - this I can explore 
the compound expression approach above.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to