Vinoth Chandar created HUDI-6712:
------------------------------------

             Summary: Implement optimized keyed lookup on parquet files
                 Key: HUDI-6712
                 URL: https://issues.apache.org/jira/browse/HUDI-6712
             Project: Apache Hudi
          Issue Type: New Feature
            Reporter: Vinoth Chandar
            Assignee: Lin Liu
             Fix For: 1.0.0


Parquet performs poorly when performing a lookup of specific records, based on 
a single key lookup column. 

e.g: select * from parquet where key in ("a","b", "c) (SQL)
e.g: List<Records> lookup(parquetFile, Set<String> keys) (code) 

Let's implement a reader, that is optimized for this pattern, by scanning least 
amount of data. 

Requirements: 
1. Need to support multiple values for same key. 
2. Can assume the file is sorted by the key/lookup field. 
3. Should handle non-existence of keys.
4. Should leverage parquet metadata (bloom filters, column index, ... ) to 
minimize read read. 
5. Must to the minimum about of RPC calls to cloud storage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to