[I] Implement optimized keyed lookup on parquet files [hudi]

via GitHub Sun, 30 Nov 2025 00:47:54 -0800


hudi-bot opened a new issue, #16182:
URL: https://github.com/apache/hudi/issues/16182


   Parquet performs poorly when performing a lookup of specific records, based 
on a single key lookup column. 
   
   e.g: select * from parquet where key in ("a","b", "c) (SQL)
   e.g: List<Records> lookup(parquetFile, Set<String> keys) (code) 
   
   Let's implement a reader, that is optimized for this pattern, by scanning 
least amount of data. 
   
   Requirements: 
   1. Need to support multiple values for same key. 
   2. Can assume the file is sorted by the key/lookup field. 
   3. Should handle non-existence of keys.
   4. Should leverage parquet metadata (bloom filters, column index, ... ) to 
minimize read read. 
   5. Must to the minimum about of RPC calls to cloud storage.
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-6712
   - Type: New Feature
   - Epic: https://issues.apache.org/jira/browse/HUDI-6242
   - Fix version(s):
     - 1.1.0
   
   
   ---
   
   
   ## Comments
   
   21/Aug/23 16:26;linliu;[~rmahindra] sent some PRs to review for the context. 
Will finish reading them and start write design doc today.;;;
   
   ---
   
   21/Aug/23 23:56;linliu;Based on these PRs,
   
   , will update the corresponding logic and do experiments.;;;
   
   ---
   
   25/Aug/23 01:30;linliu;During our process of moving lake_plumber code into 
hudi, we found that the parquet version in lake plumbe is 1.13.1, but in hudi 
is 1.10.1 for spark2, and 1.12.2 for spark3. Though we can ignore for compiling 
for spark2 for now, I have done a few checks for spark3:
    # ParquetRewriter can be compiled for parquet 1.12.2 version, and its 
benchmark has been run on a file in 1.10.1 without any issues (benchmark 
finished successfully.)
    # ParqueKeyedLookup can be compiled for parquet 1.12.2; however, its 
benchmark fails to be compiled; after commenting out the failing part, the 
benchmark threw NullPointerException during execution. After checking, the 
error is related to page index. Will dig deeper.  ;;;
   
   ---
   
   25/Aug/23 20:14;linliu;Compared the metadata between two parquet files, one 
in 1.10.1 and 1.12.3, which says the format versions are both in 1.0. I assume 
the "format version" mean the file layout; so, now I am focusing on fixing the 
problems between 1.13.1 and 1.12.2; after that migrate the code to Hudi.;;;
   
   ---
   
   12/Sep/23 01:55;vinoth;I actually had a bunch of comments on tests, docs, .. 
;;;
   
   ---
   
   20/Sep/23 18:12;linliu;Have fixed the unit tests which are caused by the 
version conflicts of Parquet format. Re-pushed and waiting to see if the unit 
tests pass or not.;;;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Implement optimized keyed lookup on parquet files [hudi]

Reply via email to