Aitozi opened a new issue, #7876: URL: https://github.com/apache/paimon/issues/7876
### Search before asking - [x] I searched in the [issues](https://github.com/apache/paimon/issues) and found nothing similar. ### Motivation ## Motivation Many tables store semi-structured attributes as `MAP<STRING, STRING>`, for example HTTP headers, request parameters, or extensible entity properties. This model is flexible, but it can be inefficient when a small set of keys is present in most rows and is frequently used by queries: - common keys such as `content-type`, `user-agent`, `locale`, or `campaign` are repeatedly stored in the map branch; - filters and projections frequently access a small set of these keys; - a standard Parquet map stores keys and values in the repeated `key_value` branch, so readers must scan the map payload even when a query only needs one common key. This proposal adds a Paimon-aware Parquet encoding for selected map columns. The logical table schema remains `MAP<STRING, STRING>`, while the physical Parquet file stores configured hot keys as independent sidecar columns. ## Goals - Keep the Paimon logical type unchanged. Users still read and write a normal map column. - Avoid introducing a custom Parquet logical type. - Reduce storage by removing promoted hot keys from the residual map branch and improve the compression ratio for independent columns as well.. - Enable column pruning and future predicate pushdown for expressions such as `headers['content-type']`. - Preserve compatibility for files written without map shredding. ## Non-Goals - This proposal does not define a new Parquet standard logical type. - The first implementation only supports top-level `MAP<STRING, STRING>` columns. - Dynamic key discovery is left for a follow-up. The initial version uses static keys configured in table options. ## Physical Layout For a logical column: ```text headers MAP<STRING, STRING> ``` Paimon may write the following Parquet physical layout: ```text headers // standard Parquet MAP, containing only residual keys __paimon_map_shred_headers_0 // value for configured hot key 0 __paimon_map_shred_headers_1 // value for configured hot key 1 ... ``` For example, if the configured keys are `content-type,user-agent`, a row: ```text headers = { "content-type": "application/json", "user-agent": "mobile-app", "x-request-id": "req-001" } ``` is written as: ```text headers = { "x-request-id": "req-001" } __paimon_map_shred_headers_0 = "application/json" __paimon_map_shred_headers_1 = "mobile-app" ``` When Paimon reads the file, it reconstructs the logical map by merging the residual map with the sidecar values. ## Metadata The encoding is a Paimon physical optimization. Paimon records the configured keys in Parquet file metadata with keys under: ```text paimon.map.shredding.<column>.keys ``` Table options define the writer and reader contract: ```text parquet.map.shredding.columns = headers parquet.map.shredding.headers.keys = content-type,user-agent,locale ``` If a file does not contain sidecar columns, Paimon reads the map as a normal Parquet map. ## Semantics - The residual map is a valid standard Parquet map. - Promoted non-null hot-key values are removed from the residual map and written to sidecar columns. - Null hot-key values remain in the residual map. This preserves the distinction between an explicit null value and an absent key. - External Parquet readers that are not Paimon-aware will see only the residual map and sidecar columns. They cannot reconstruct the original logical map without Paimon metadata and semantics. ## Query Benefits The encoding makes it possible to rewrite: ```sql headers['content-type'] ``` to the corresponding sidecar column when the key is configured as hot. This enables: - reading only the sidecar column for frequently accessed key projections; - skipping the repeated map branch for frequently accessed key filters; - using Parquet column statistics and page indexes for these predicates in a follow-up. ## Storage Benefits The encoding avoids repeating common key strings in every map entry and lets promoted values be compressed as normal scalar columns. This is especially useful when the same keys appear in a large fraction of rows. ## Compatibility This is not a general Parquet MAP representation. It is a Paimon-aware residual encoding. The advantage over a custom Parquet logical type is that the file still uses standard Parquet columns and does not require changes in parquet-format or parquet-mr. The trade-off is that old Paimon versions and generic Parquet readers cannot reconstruct the complete logical map. This is acceptable only when the table explicitly enables the feature. ## Future Work - Support nested map columns. - Infer hot keys from buffered rows, similar to Variant shredding schema inference. - Push down hot-key projection and predicates from Spark/Flink planners to the Parquet reader. ### Solution _No response_ ### Anything else? _No response_ ### Are you willing to submit a PR? - [x] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
