[
https://issues.apache.org/jira/browse/HUDI-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757003#comment-17757003
]
Lin Liu commented on HUDI-6712:
-------------------------------
[~rmahindra] sent some PRs to review for the context. Will finish reading them
and start write design doc today.
> Implement optimized keyed lookup on parquet files
> -------------------------------------------------
>
> Key: HUDI-6712
> URL: https://issues.apache.org/jira/browse/HUDI-6712
> Project: Apache Hudi
> Issue Type: New Feature
> Reporter: Vinoth Chandar
> Assignee: Lin Liu
> Priority: Major
> Fix For: 1.0.0
>
>
> Parquet performs poorly when performing a lookup of specific records, based
> on a single key lookup column.
> e.g: select * from parquet where key in ("a","b", "c) (SQL)
> e.g: List<Records> lookup(parquetFile, Set<String> keys) (code)
> Let's implement a reader, that is optimized for this pattern, by scanning
> least amount of data.
> Requirements:
> 1. Need to support multiple values for same key.
> 2. Can assume the file is sorted by the key/lookup field.
> 3. Should handle non-existence of keys.
> 4. Should leverage parquet metadata (bloom filters, column index, ... ) to
> minimize read read.
> 5. Must to the minimum about of RPC calls to cloud storage.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)