hudi-bot opened a new issue, #16182:
URL: https://github.com/apache/hudi/issues/16182
Parquet performs poorly when performing a lookup of specific records, based
on a single key lookup column.
e.g: select * from parquet where key in ("a","b", "c) (SQL)
e.g: List<Records> lookup(parquetFile, Set<String> keys) (code)
Let's implement a reader, that is optimized for this pattern, by scanning
least amount of data.
Requirements:
1. Need to support multiple values for same key.
2. Can assume the file is sorted by the key/lookup field.
3. Should handle non-existence of keys.
4. Should leverage parquet metadata (bloom filters, column index, ... ) to
minimize read read.
5. Must to the minimum about of RPC calls to cloud storage.
## JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-6712
- Type: New Feature
- Epic: https://issues.apache.org/jira/browse/HUDI-6242
- Fix version(s):
- 1.1.0
---
## Comments
21/Aug/23 16:26;linliu;[~rmahindra] sent some PRs to review for the context.
Will finish reading them and start write design doc today.;;;
---
21/Aug/23 23:56;linliu;Based on these PRs,
, will update the corresponding logic and do experiments.;;;
---
25/Aug/23 01:30;linliu;During our process of moving lake_plumber code into
hudi, we found that the parquet version in lake plumbe is 1.13.1, but in hudi
is 1.10.1 for spark2, and 1.12.2 for spark3. Though we can ignore for compiling
for spark2 for now, I have done a few checks for spark3:
# ParquetRewriter can be compiled for parquet 1.12.2 version, and its
benchmark has been run on a file in 1.10.1 without any issues (benchmark
finished successfully.)
# ParqueKeyedLookup can be compiled for parquet 1.12.2; however, its
benchmark fails to be compiled; after commenting out the failing part, the
benchmark threw NullPointerException during execution. After checking, the
error is related to page index. Will dig deeper. ;;;
---
25/Aug/23 20:14;linliu;Compared the metadata between two parquet files, one
in 1.10.1 and 1.12.3, which says the format versions are both in 1.0. I assume
the "format version" mean the file layout; so, now I am focusing on fixing the
problems between 1.13.1 and 1.12.2; after that migrate the code to Hudi.;;;
---
12/Sep/23 01:55;vinoth;I actually had a bunch of comments on tests, docs, ..
;;;
---
20/Sep/23 18:12;linliu;Have fixed the unit tests which are caused by the
version conflicts of Parquet format. Re-pushed and waiting to see if the unit
tests pass or not.;;;
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]